跳到主要内容

2025-05-13-12-09

A Grounded Memory System For Smart Personal Assistants

Abstract

arXiv:2505.06328v1 Announce Type: new Abstract: A wide variety of agentic AI applications - ranging from cognitive assistants for dementia patients to robotics - demand a robust memory system grounded in reality. In this paper, we propose such a memory system consisting of three components. First, we combine Vision Language Models for image captioning and entity disambiguation with Large Language Models for consistent information extraction during perception. Second, the extracted information is represented in a memory consisting of a knowledge graph enhanced by vector embeddings to efficiently manage relational information. Third, we combine semantic search and graph query generation for question answering via Retrieval Augmented Generation. We illustrate the system's working and potential using a real-world example.

摘要

从面向痴呆患者的认知辅助到机器人技术等各类智能体AI应用,都需要一个基于现实的鲁棒记忆系统。本文提出了一种由三个组件构成的记忆系统:首先,我们结合视觉语言模型(用于图像描述和实体消歧)与大语言模型(用于感知过程中保持信息提取的一致性);其次,将提取的信息存储于由向量嵌入增强的知识图谱内存中,以高效管理关系型信息;最后,我们通过检索增强生成技术,整合语义搜索与图谱查询生成来实现问答功能。通过真实案例展示了该系统的工作原理及应用潜力。


KCluster: An LLM-based Clustering Approach to Knowledge Component Discovery

Abstract

arXiv:2505.06469v1 Announce Type: new Abstract: Educators evaluate student knowledge using knowledge component (KC) models that map assessment questions to KCs. Still, designing KC models for large question banks remains an insurmountable challenge for instructors who need to analyze each question by hand. The growing use of Generative AI in education is expected only to aggravate this chronic deficiency of expert-designed KC models, as course engineers designing KCs struggle to keep up with the pace at which questions are generated. In this work, we propose KCluster, a novel KC discovery algorithm based on identifying clusters of congruent questions according to a new similarity metric induced by a large language model (LLM). We demonstrate in three datasets that an LLM can create an effective metric of question similarity, which a clustering algorithm can use to create KC models from questions with minimal human effort. Combining the strengths of LLM and clustering, KCluster generates descriptive KC labels and discovers KC models that predict student performance better than the best expert-designed models available. In anticipation of future work, we illustrate how KCluster can reveal insights into difficult KCs and suggest improvements to instruction.

摘要

教育工作者通过将评估问题映射到知识组件(KC)的知识组件模型来评估学生知识。然而,为大型题库设计KC模型对需要手动分析每个问题的教师而言仍是难以克服的挑战。教育领域生成式AI的日益普及预计只会加剧这种专家设计KC模型的长期不足,因为课程设计者难以跟上问题生成的速度。本研究提出KCluster——一种基于大型语言模型(LLM)诱导的新相似性度量来识别一致性问题簇的新型KC发现算法。我们在三个数据集中证明,LLM可以创建有效的问题相似性度量,聚类算法可借此以最小人力投入从问题中创建KC模型。结合LLM与聚类的优势,KCluster能生成描述性KC标签,并发现比现有最佳专家设计模型更能预测学生表现的KC模型。针对未来研究,我们展示了KCluster如何揭示困难KC的洞见,并为教学改进提供建议。


A New DAPO Algorithm for Stock Trading

Abstract

arXiv:2505.06408v1 Announce Type: new Abstract: Recent advances in reinforcement learning, such as Dynamic Sampling Policy Optimization (DAPO), show strong performance when paired with large language models (LLMs). Motivated by this success, we ask whether similar gains can be realized in financial trading. We design a trading agent that combines an improved Group Relative Policy Optimization (GRPO) algorithm, augmented with ideas from DAPO, with LLM-based risk and sentiment signals extracted from financial news. On the NASDAQ-100 index (FNSPID dataset), our agent attains a cumulative return of 230.49 percent and an information ratio of 0.37, outperforming the CPPO-DeepSeek baseline. It also cuts training time from about 8 hours to 2.5 hours over 100 epochs while markedly reducing RAM usage. The proposed RL-LLM framework offers a scalable path toward data-efficient trading agents. Code: https://github.com/Ruijian-Zha/FinRL-DAPO-SR/

摘要

强化学习领域的最新进展,如动态采样策略优化(DAPO),在与大语言模型(LLM)结合时展现出强劲性能。受此成功启发,我们探讨类似优势能否在金融交易中实现。我们设计了一个交易智能体,将改进的组相对策略优化(GRPO)算法(融合了DAPO的思想)与基于LLM的金融新闻风险及情绪信号相结合。在纳斯达克100指数(FNSPID数据集)上,该智能体累计收益率达230.49%,信息比率为0.37,优于CPPO-DeepSeek基线模型。同时,在100个训练周期内将训练时间从约8小时缩短至2.5小时,并显著降低内存占用。所提出的RL-LLM框架为构建数据高效型交易智能体提供了可扩展路径。


Challenging GPU Dominance: When CPUs Outperform for On-Device LLM Inference

Abstract

arXiv:2505.06461v1 Announce Type: new Abstract: The common assumption in on-device AI is that GPUs, with their superior parallel processing, always provide the best performance for large language model (LLM) inference. In this work, we challenge this notion by empirically demonstrating that, under certain conditions, CPUs can outperform GPUs for LLM inference on mobile devices. Using a 1-billion-parameter LLM deployed via llama.cpp on the iPhone 15 Pro, we show that a CPU-only configuration (two threads, F16 precision) achieves 17 tokens per second, surpassing the 12.8 tokens per second obtained with GPU acceleration. We analyze the architectural factors driving this counterintuitive result, revealing that GPU memory transfer overhead and CPU thread optimization play a critical role. Furthermore, we explore the impact of thread oversubscription, quantization strategies, and hardware constraints, providing new insights into efficient on-device AI execution. Our findings challenge conventional GPU-first thinking, highlighting the untapped potential of optimized CPU inference and paving the way for smarter deployment strategies in mobile AI. However, fully explaining the observed CPU advantage remains difficult due to limited access to low-level profiling tools on iOS.

摘要

设备端人工智能的普遍假设认为,凭借其卓越的并行处理能力,GPU始终能为大语言模型(LLM)推理提供最佳性能。本研究通过实证分析挑战了这一观点,发现在特定条件下,移动设备上CPU的LLM推理性能可超越GPU。我们在iPhone 15 Pro上通过llama.cpp部署10亿参数LLM,证明纯CPU配置(双线程,F16精度)可实现每秒17个token,优于GPU加速获得的12.8个token/秒。通过分析驱动这一反直觉结果的架构因素,我们发现GPU内存传输开销和CPU线程优化起关键作用。进一步探究了线程过载、量化策略和硬件限制的影响,为设备端AI高效执行提供了新见解。这些发现挑战了传统的GPU优先思维,揭示了优化CPU推理的未开发潜力,为移动AI部署策略开辟了新路径。然而,由于iOS底层分析工具的访问限制,完全解释观察到的CPU优势仍存在困难。


Towards Efficient LLM Storage Reduction via Tensor Deduplication and Delta Compression

Abstract

arXiv:2505.06252v1 Announce Type: new Abstract: Modern model hubs, such as Hugging Face, store tens of petabytes of LLMs, with fine-tuned variants vastly outnumbering base models and dominating storage consumption. Existing storage reduction techniques -- such as deduplication and compression -- are either LLM oblivious or not compatible with each other, limiting data reduction effectiveness. Our large-scale characterization study across all publicly available Hugging Face LLM repositories reveals several key insights: (1) fine-tuned models within the same family exhibit highly structured, sparse parameter differences suitable for delta compression; (2) bitwise similarity enables LLM family clustering; and (3) tensor-level deduplication offers strong synergy with model aware compressors. Building on these insights, we present BitX, an effective, fast, lossless delta compression algorithm that compresses XORed redundancy between fine-tuned and base LLMs. We build zLLM, a model storage reduction pipeline that unifies tensor-level deduplication and lossless BitX compression. By synergizing deduplication and compression around LLM family clustering, zLLM reduces model storage consumption by 49.5 percent, over 20 percent more than state-of-the-art deduplication and compression designs.

摘要

现代模型中心(如Hugging Face)存储着数十PB量级的大语言模型(LLM),其中微调变体的数量远超基础模型并主导存储消耗。现有存储缩减技术(如去重和压缩)要么未针对LLM优化,要么彼此不兼容,限制了数据缩减效果。我们对Hugging Face所有公开LLM仓库的大规模特征分析揭示了关键发现:(1) 同一家族的微调模型呈现高度结构化、稀疏的参数差异,适合采用增量压缩;(2) 比特级相似性支持LLM家族聚类;(3) 张量级去重与模型感知压缩器具有强协同效应。基于这些发现,我们提出BitX算法——一种高效、快速、无损的增量压缩方法,专门压缩微调LLM与基础模型间的XOR冗余。我们构建了zLLM存储缩减管道,统一整合张量级去重与无损BitX压缩。通过围绕LLM家族聚类协同去重与压缩,zLLM将模型存储消耗降低49.5%,较现有最优去重与压缩方案提升超20个百分点。


Text-to-CadQuery: A New Paradigm for CAD Generation with Scalable Large Model Capabilities

Abstract

arXiv:2505.06507v1 Announce Type: new Abstract: Computer-aided design (CAD) is fundamental to modern engineering and manufacturing, but creating CAD models still requires expert knowledge and specialized software. Recent advances in large language models (LLMs) open up the possibility of generative CAD, where natural language is directly translated into parametric 3D models. However, most existing methods generate task-specific command sequences that pretrained models cannot directly handle. These sequences must be converted into CAD representations such as CAD vectors before a 3D model can be produced, which requires training models from scratch and adds unnecessary complexity. To tackle this issue, we propose generating CadQuery code directly from text, leveraging the strengths of pretrained LLMs to produce 3D models without intermediate representations, using this Python-based scripting language. Since LLMs already excel at Python generation and spatial reasoning, fine-tuning them on Text-to-CadQuery data proves highly effective. Given that these capabilities typically improve with scale, we hypothesize that larger models will perform better after fine-tuning. To enable this, we augment the Text2CAD dataset with 170,000 CadQuery annotations. We fine-tune six open-source LLMs of varying sizes and observe consistent improvements. Our best model achieves a top-1 exact match of 69.3%, up from 58.8%, and reduces Chamfer Distance by 48.6%. Project page: https://github.com/Text-to-CadQuery/Text-to-CadQuery.

摘要

计算机辅助设计(CAD)是现代工程与制造的基础,但创建CAD模型仍需专业知识与专用软件。大型语言模型(LLM)的最新进展为生成式CAD开辟了新途径——将自然语言直接转换为参数化3D模型。然而现有方法多生成特定任务指令序列,预训练模型无法直接处理,必须将序列转换为CAD向量等中间表示才能生成3D模型,这不仅需从头训练模型,还引入冗余复杂度。为解决该问题,我们提出通过基于Python的脚本语言CadQuery直接从文本生成代码,利用预训练LLM优势跳过中间表示环节生成3D模型。由于LLM本身具备优秀的Python生成与空间推理能力,在文本到CadQuery数据上的微调效果显著。鉴于这些能力通常随模型规模提升而增强,我们假设更大模型经微调后表现更优。为此,我们在Text2CAD数据集中新增17万条CadQuery标注,并对6个不同规模的开源LLM进行微调,结果均显示持续改进。最佳模型的top-1精确匹配率从58.8%提升至69.3%,倒角距离降低48.6%。项目页面:https://github.com/Text-to-CadQuery/Text-to-CadQuery。


Reliable Collaborative Conversational Agent System Based on LLMs and Answer Set Programming

Abstract

arXiv:2505.06438v1 Announce Type: new Abstract: As the Large-Language-Model-driven (LLM-driven) Artificial Intelligence (AI) bots became popular, people realized their strong potential in Task-Oriented Dialogue (TOD). However, bots relying wholly on LLMs are unreliable in their knowledge, and whether they can finally produce a correct result for the task is not guaranteed. The collaboration among these agents also remains a challenge, since the necessary information to convey is unclear, and the information transfer is by prompts -- unreliable, and malicious knowledge is easy to inject. With the help of logic programming tools such as Answer Set Programming (ASP), conversational agents can be built safely and reliably, and communication among the agents made more efficient and secure. We proposed an Administrator-Assistant Dual-Agent paradigm, where the two ASP-driven bots share the same knowledge base and complete their tasks independently, while the information can be passed by a Collaborative Rule Set (CRS). The knowledge and information conveyed are encapsulated and invisible to the users, ensuring the security of information transmission. We have constructed AutoManager, a dual-agent system for managing the drive-through window of a fast-food restaurant such as Taco Bell in the US. In AutoManager, the assistant bot takes the customer's order while the administrator bot manages the menu and food supply. We evaluated our AutoManager and compared it with the real-world Taco Bell Drive-Thru AI Order Taker, and the results show that our method is more reliable.

摘要

随着大语言模型驱动(LLM-driven)的人工智能(AI)机器人日益普及,人们认识到其在任务导向对话(TOD)中的强大潜力。然而,完全依赖大语言模型的机器人在知识可靠性方面存在缺陷,其能否最终为任务生成正确结果尚未可知。这些智能体之间的协作仍面临挑战,因为需传递的必要信息不明确,且信息传输通过提示词(prompts)实现——这种方式不可靠且易被注入恶意知识。借助答案集编程(ASP)等逻辑编程工具,可以安全可靠地构建对话代理,并使智能体间的通信更高效、更安全。我们提出了一种"管理员-助手"双代理范式,其中两个ASP驱动的机器人共享相同知识库并独立完成任务,同时通过协作规则集(CRS)传递信息。所传递的知识和信息经过封装,对用户不可见,从而确保信息传输的安全性。我们构建了AutoManager系统,这是一个用于管理美国塔可钟等快餐店汽车餐厅窗口的双代理系统。在AutoManager中,助手机器人负责接收顾客订单,而管理员机器人则管理菜单和食品供应。通过将AutoManager与真实世界的塔可钟汽车餐厅AI订单接收系统进行比较评估,结果表明我们的方法更具可靠性。


Control Plane as a Tool: A Scalable Design Pattern for Agentic AI Systems

Abstract

arXiv:2505.06817v1 Announce Type: new Abstract: Agentic AI systems represent a new frontier in artificial intelligence, where agents often based on large language models(LLMs) interact with tools, environments, and other agents to accomplish tasks with a degree of autonomy. These systems show promise across a range of domains, but their architectural underpinnings remain immature. This paper conducts a comprehensive review of the types of agents, their modes of interaction with the environment, and the infrastructural and architectural challenges that emerge. We identify a gap in how these systems manage tool orchestration at scale and propose a reusable design abstraction: the "Control Plane as a Tool" pattern. This pattern allows developers to expose a single tool interface to an agent while encapsulating modular tool routing logic behind it. We position this pattern within the broader context of agent design and argue that it addresses several key challenges in scaling, safety, and extensibility.

摘要

代理式人工智能系统代表了人工智能的新前沿,这类通常基于大语言模型(LLMs)的代理通过与工具、环境及其他代理的交互,以一定自主性完成任务。这些系统在多个领域展现出潜力,但其架构基础仍不成熟。本文系统梳理了代理类型、其与环境交互的模式,以及由此产生的基础设施与架构挑战。我们发现这些系统在大规模工具编排管理方面存在空白,并提出一种可复用的设计抽象:'控制平面即工具'模式。该模式允许开发者向代理暴露单一工具接口,同时在其背后封装模块化的工具路由逻辑。我们将此模式置于更广泛的代理设计语境中,论证其能有效解决扩展性、安全性和可扩展性方面的若干关键挑战。


AI-CDA4All: Democratizing Cooperative Autonomous Driving for All Drivers via Affordable Dash-cam Hardware and Open-source AI Software

Abstract

arXiv:2505.06749v1 Announce Type: new Abstract: As transportation technology advances, the demand for connected vehicle infrastructure has greatly increased to improve their efficiency and safety. One area of advancement, Cooperative Driving Automation (CDA) still relies on expensive autonomy sensors or connectivity units and are not interoperable across existing market car makes/models, limiting its scalability on public roads. To fill these gaps, this paper presents a novel approach to democratizing CDA technology, it leverages low-cost, commercially available edge devices such as vehicle dash-cams and open-source software to make the technology accessible and scalable to be used in transportation infrastructure and broader public domains. This study also investigates the feasibility of utilizing cost-effective communication protocols based on LTE and WiFi. These technologies enable lightweight Vehicle-to-Everything (V2X) communications, facilitating real-time data exchange between vehicles and infrastructure. Our research and development efforts are aligned with industrial standards to ensure compatibility and future integration into existing transportation ecosystems. By prioritizing infrastructure-oriented applications, such as improved traffic flow management, this approach seeks to deliver tangible societal benefits without directly competing with vehicle OEMs. As recent advancement of Generative AI (GenAI), there is no standardized integration of GenAI technologies into open-source CDAs, as the current trends of muiltimodal large language models gain popularity, we demonstrated a feasible locally deployed edge LLM models can enhance driving experience while preserving privacy and security compared to cloud-connected solutions. The proposed system underscores the potential of low-cost, scalable solutions in advancing CDA functionality, paving the way for smarter, safer, and more inclusive transportation networks.

摘要

随着交通技术的进步,为提高效率与安全性,对网联车辆基础设施的需求大幅增长。协同驾驶自动化(CDA)作为重要发展方向,目前仍依赖昂贵的自动驾驶传感器或网联单元,且无法跨现有市场车型实现互操作,限制了其在公共道路的可扩展性。为填补这些空白,本文提出一种普及CDA技术的新方法:通过商用低成本边缘设备(如车载摄像头)和开源软件,使该技术具备可及性与可扩展性,适用于交通基础设施及更广泛的公共领域。本研究还探讨了基于LTE和WiFi的经济型通信协议的可行性,这些技术支持轻量级车联网(V2X)通信,实现车辆与基础设施间的实时数据交换。我们的研发工作遵循工业标准,以确保兼容性及未来与现有交通系统的整合。通过优先发展以基础设施为导向的应用(如改进交通流管理),该方案可在不与整车厂直接竞争的前提下提供切实的社会效益。鉴于生成式人工智能(GenAI)的最新进展,目前开源CDA尚缺乏GenAI技术的标准化集成。随着多模态大语言模型的流行趋势,我们验证了本地部署的边缘大语言模型在提升驾驶体验方面的可行性,相比云端方案更能保障隐私与安全性。本系统彰显了低成本可扩展方案在推进CDA功能方面的潜力,为构建更智能、更安全、更包容的交通网络铺平道路。


LLM-Augmented Chemical Synthesis and Design Decision Programs

Abstract

arXiv:2505.07027v1 Announce Type: new Abstract: Retrosynthesis, the process of breaking down a target molecule into simpler precursors through a series of valid reactions, stands at the core of organic chemistry and drug development. Although recent machine learning (ML) research has advanced single-step retrosynthetic modeling and subsequent route searches, these solutions remain restricted by the extensive combinatorial space of possible pathways. Concurrently, large language models (LLMs) have exhibited remarkable chemical knowledge, hinting at their potential to tackle complex decision-making tasks in chemistry. In this work, we explore whether LLMs can successfully navigate the highly constrained, multi-step retrosynthesis planning problem. We introduce an efficient scheme for encoding reaction pathways and present a new route-level search strategy, moving beyond the conventional step-by-step reactant prediction. Through comprehensive evaluations, we show that our LLM-augmented approach excels at retrosynthesis planning and extends naturally to the broader challenge of synthesizable molecular design.

摘要

逆合成分析是通过一系列有效反应将目标分子分解为更简单前体的过程,是有机化学和药物研发的核心环节。尽管近期机器学习研究在单步逆合成建模及后续路径搜索方面取得进展,这些解决方案仍受限于庞大可能的路径组合空间。与此同时,大型语言模型展现出卓越的化学知识储备,暗示其有望解决化学领域的复杂决策任务。本研究探讨大型语言模型能否成功应对高度受限的多步逆合成规划问题。我们提出一种高效的反应路径编码方案,并开发新型路线级搜索策略,突破传统逐步反应物预测的局限。综合评估表明,我们的语言模型增强方法在逆合成规划中表现优异,并能自然扩展到可合成分子设计这一更广泛的挑战领域。


Towards Artificial General or Personalized Intelligence? A Survey on Foundation Models for Personalized Federated Intelligence

Abstract

arXiv:2505.06907v1 Announce Type: new Abstract: The rise of large language models (LLMs), such as ChatGPT, DeepSeek, and Grok-3, has reshaped the artificial intelligence landscape. As prominent examples of foundational models (FMs) built on LLMs, these models exhibit remarkable capabilities in generating human-like content, bringing us closer to achieving artificial general intelligence (AGI). However, their large-scale nature, sensitivity to privacy concerns, and substantial computational demands present significant challenges to personalized customization for end users. To bridge this gap, this paper presents the vision of artificial personalized intelligence (API), focusing on adapting these powerful models to meet the specific needs and preferences of users while maintaining privacy and efficiency. Specifically, this paper proposes personalized federated intelligence (PFI), which integrates the privacy-preserving advantages of federated learning (FL) with the zero-shot generalization capabilities of FMs, enabling personalized, efficient, and privacy-protective deployment at the edge. We first review recent advances in both FL and FMs, and discuss the potential of leveraging FMs to enhance federated systems. We then present the key motivations behind realizing PFI and explore promising opportunities in this space, including efficient PFI, trustworthy PFI, and PFI empowered by retrieval-augmented generation (RAG). Finally, we outline key challenges and future research directions for deploying FM-powered FL systems at the edge with improved personalization, computational efficiency, and privacy guarantees. Overall, this survey aims to lay the groundwork for the development of API as a complement to AGI, with a particular focus on PFI as a key enabling technique.

摘要

以ChatGPT、DeepSeek和Grok-3为代表的大型语言模型(LLMs)的兴起重塑了人工智能领域格局。作为基于LLMs构建的基础模型(FMs)的典型范例,这些模型展现出生成类人内容的卓越能力,使我们更接近实现通用人工智能(AGI)。然而,其大规模特性、对隐私问题的敏感性及高昂计算需求,为终端用户的个性化定制带来了重大挑战。为弥合这一鸿沟,本文提出"人工个性化智能"(API)的愿景,聚焦于适配这些强大模型以满足用户特定需求与偏好,同时保障隐私与效率。具体而言,本文提出"个性化联邦智能"(PFI),通过融合联邦学习(FL)的隐私保护优势与FMs的零样本泛化能力,实现边缘侧个性化、高效且隐私安全的部署。我们首先综述FL与FMs领域的最新进展,探讨利用FMs增强联邦系统的潜力;继而阐述实现PFI的核心动因,并挖掘该领域的潜在机遇,包括高效PFI、可信PFI以及检索增强生成(RAG)赋能的PFI。最后,我们系统梳理了在边缘端部署FM驱动的FL系统时面临的关键挑战与未来研究方向,涉及个性化提升、计算效率优化和隐私保障强化。本综述旨在为API作为AGI补充形态的发展奠定理论基础,其中PFI将作为核心使能技术获得重点探讨。


From Knowledge to Reasoning: Evaluating LLMs for Ionic Liquids Research in Chemical and Biological Engineering

Abstract

arXiv:2505.06964v1 Announce Type: new Abstract: Although Large Language Models (LLMs) have achieved remarkable performance in diverse general knowledge and reasoning tasks, their utility in the scientific domain of Chemical and Biological Engineering (CBE) is unclear. Hence, it necessitates challenging evaluation benchmarks that can measure LLM performance in knowledge- and reasoning-based tasks, which is lacking. As a foundational step, we empirically measure the reasoning capabilities of LLMs in CBE. We construct and share an expert-curated dataset of 5,920 examples for benchmarking LLMs' reasoning capabilities in the niche domain of Ionic Liquids (ILs) for carbon sequestration, an emergent solution to reducing global warming. The dataset presents different difficulty levels by varying along the dimensions of linguistic and domain-specific knowledge. Benchmarking three less than 10B parameter open-source LLMs on the dataset suggests that while smaller general-purpose LLMs are knowledgeable about ILs, they lack domain-specific reasoning capabilities. Based on our results, we further discuss considerations for leveraging LLMs for carbon capture research using ILs. Since LLMs have a high carbon footprint, gearing them for IL research can symbiotically benefit both fields and help reach the ambitious carbon neutrality target by 2050. Dataset link: https://github.com/sougata-ub/llms_for_ionic_liquids

摘要

尽管大型语言模型(LLMs)在多样化的通用知识与推理任务中表现出卓越性能,但其在化学与生物工程(CBE)科学领域的适用性尚不明确。当前亟需能够评估LLMs在基于知识与推理任务中表现的挑战性基准,而此类基准目前仍属空白。作为基础性工作,我们通过实证方法测量了LLMs在CBE领域的推理能力。我们构建并共享了一个由专家精心策划的数据集,包含5,920个样本,用于评估LLMs在离子液体(ILs)这一碳封存新兴解决方案细分领域的推理能力——该技术对缓解全球变暖具有重要意义。该数据集通过语言复杂度与领域专业知识两个维度的差异化设计,呈现不同难度层级。对三个参数量小于100亿的开源LLMs的基准测试表明:虽然通用型小规模LLMs具备离子液体的基础知识,但缺乏领域特异性推理能力。基于实验结果,我们进一步探讨了利用LLMs开展离子液体碳捕集研究的注意事项。鉴于LLMs本身具有高碳足迹特性,将其应用于离子液体研究可形成协同效应,助力2050年碳中和宏伟目标的实现。数据集链接:https://github.com/sougata-ub/llms_for_ionic_liquids


Architectural Precedents for General Agents using Large Language Models

Abstract

arXiv:2505.07087v1 Announce Type: new Abstract: One goal of AI (and AGI) is to identify and understand specific mechanisms and representations sufficient for general intelligence. Often, this work manifests in research focused on architectures and many cognitive architectures have been explored in AI/AGI. However, different research groups and even different research traditions have somewhat independently identified similar/common patterns of processes and representations or cognitive design patterns that are manifest in existing architectures. Today, AI systems exploiting large language models (LLMs) offer a relatively new combination of mechanism and representation available for exploring the possibilities of general intelligence. In this paper, we summarize a few recurring cognitive design patterns that have appeared in various pre-transformer AI architectures. We then explore how these patterns are evident in systems using LLMs, especially for reasoning and interactive ("agentic") use cases. By examining and applying these recurring patterns, we can also predict gaps or deficiencies in today's Agentic LLM Systems and identify likely subjects of future research towards general intelligence using LLMs and other generative foundation models.

摘要

人工智能(AGI)的目標之一在於識別並理解實現通用智能所需的特定機制與表徵體系。這類研究通常體現為對架構的探索,人工智能領域已湧現出多種認知架構。然而,不同研究團隊乃至不同學術傳統相對獨立地發現了存在於現有架構中的相似/共性處理模式與表徵方式——即認知設計模式。當前,基於大語言模型(LLM)的AI系統為探索通用智能可能性提供了機制與表徵的新穎組合。本文首先歸納了前Transformer時代各類AI架構中反覆出現的認知設計模式,進而探討這些模式如何體現在LLM系統中——特別是在推理與交互式("智能體")應用場景。通過分析與應用這些重現模式,我們既能預測當代智能體LLM系統的缺陷與不足,也能為未來基於LLM及其他生成式基礎模型實現通用智能的研究指明可能方向。


Internet of Agents: Fundamentals, Applications, and Challenges

Abstract

arXiv:2505.07176v1 Announce Type: new Abstract: With the rapid proliferation of large language models and vision-language models, AI agents have evolved from isolated, task-specific systems into autonomous, interactive entities capable of perceiving, reasoning, and acting without human intervention. As these agents proliferate across virtual and physical environments, from virtual assistants to embodied robots, the need for a unified, agent-centric infrastructure becomes paramount. In this survey, we introduce the Internet of Agents (IoA) as a foundational framework that enables seamless interconnection, dynamic discovery, and collaborative orchestration among heterogeneous agents at scale. We begin by presenting a general IoA architecture, highlighting its hierarchical organization, distinguishing features relative to the traditional Internet, and emerging applications. Next, we analyze the key operational enablers of IoA, including capability notification and discovery, adaptive communication protocols, dynamic task matching, consensus and conflict-resolution mechanisms, and incentive models. Finally, we identify open research directions toward building resilient and trustworthy IoA ecosystems.

摘要

随着大型语言模型和视觉语言模型的快速普及,AI智能体已从孤立的任务专用系统演变为能够自主感知、推理和执行而无需人工干预的交互实体。当这些智能体在虚拟和物理环境中广泛部署——从虚拟助手到具身机器人时,建立以智能体为中心的标准化基础设施变得至关重要。本综述提出'智能体互联网'(IoA)作为基础框架,旨在实现大规模异构智能体间的无缝互联、动态发现与协同编排。我们首先阐述通用IoA体系架构,重点分析其层次化组织结构、相较于传统互联网的差异化特征以及新兴应用场景;继而剖析IoA运行的关键使能技术,包括能力通告与发现机制、自适应通信协议、动态任务匹配、共识与冲突解决机制以及激励模型;最后指出构建高鲁棒性、可信赖IoA生态系统的开放研究方向。


DialogueReason: Rule-Based RL Sparks Dialogue Reasoning in LLMs

Abstract

arXiv:2505.07049v1 Announce Type: new Abstract: We propose DialogueReason, a reasoning paradigm that uncovers the lost roles in monologue-style reasoning models, aiming to boost diversity and coherency of the reasoning process. Recent advances in RL-based large reasoning models have led to impressive long CoT capabilities and high performance on math and science benchmarks. However, these reasoning models rely mainly on monologue-style reasoning, which often limits reasoning diversity and coherency, frequently recycling fixed strategies or exhibiting unnecessary shifts in attention. Our work consists of an analysis of monologue reasoning patterns and the development of a dialogue-based reasoning approach. We first introduce the Compound-QA task, which concatenates multiple problems into a single prompt to assess both diversity and coherency of reasoning. Our analysis shows that Compound-QA exposes weaknesses in monologue reasoning, evidenced by both quantitative metrics and qualitative reasoning traces. Building on the analysis, we propose a dialogue-based reasoning, named DialogueReason, structured around agents, environment, and interactions. Using PPO with rule-based rewards, we train open-source LLMs (Qwen-QWQ and Qwen-Base) to adopt dialogue reasoning. We evaluate trained models on MATH, AIME, and GPQA datasets, showing that the dialogue reasoning model outperforms monologue models under more complex compound questions. Additionally, we discuss how dialogue-based reasoning helps enhance interpretability, facilitate more intuitive human interaction, and inspire advances in multi-agent system design.

摘要

我们提出对话式推理范式DialogueReason,旨在通过恢复独白式推理模型中缺失的角色互动来增强推理过程的多样性与连贯性。尽管基于强化学习的大型推理模型近期在长链推理能力和数理科学基准测试中表现优异,但这些模型主要依赖独白式推理,往往导致推理策略单一化或注意力无谓转移等问题。本研究包含对独白推理模式的分析和对话式推理方法的开发:首先设计复合问答任务(Compound-QA),通过多问题串联提示评估推理的多样性与连贯性。分析表明,该任务从量化指标和定性推理轨迹两方面揭示了独白推理的缺陷。基于此,我们构建了包含智能体、环境与交互要素的对话式推理框架DialogueReason,采用基于规则的PPO奖励机制训练开源大语言模型(Qwen-QWQ与Qwen-Base)。在MATH、AIME和GPQA数据集上的实验表明,面对复杂复合问题时,对话推理模型性能显著优于独白模型。此外,我们探讨了对话式推理如何提升模型可解释性、优化人机交互直觉性,并为多智能体系统设计提供新思路。


RefPentester: A Knowledge-Informed Self-Reflective Penetration Testing Framework Based on Large Language Models

Abstract

arXiv:2505.07089v1 Announce Type: new Abstract: Automated penetration testing (AutoPT) powered by large language models (LLMs) has gained attention for its ability to automate ethical hacking processes and identify vulnerabilities in target systems by leveraging the intrinsic knowledge of LLMs. However, existing LLM-based AutoPT frameworks often underperform compared to human experts in challenging tasks for several reasons: the imbalanced knowledge used in LLM training, short-sighted planning in the planning process, and hallucinations during command generation. In addition, the penetration testing (PT) process, with its trial-and-error nature, is limited by existing frameworks that lack mechanisms to learn from previous failed operations, restricting adaptive improvement of PT strategies. To address these limitations, we propose a knowledge-informed self-reflective PT framework powered by LLMs, called RefPentester, which is an AutoPT framework designed to assist human operators in identifying the current stage of the PT process, selecting appropriate tactic and technique for the stage, choosing suggested action, providing step-by-step operational guidance, and learning from previous failed operations. We also modeled the PT process as a seven-state Stage Machine to integrate the proposed framework effectively. The evaluation shows that RefPentester can successfully reveal credentials on Hack The Box's Sau machine, outperforming the baseline GPT-4o model by 16.7%. Across PT stages, RefPentester also demonstrates superior success rates on PT stage transitions.

摘要

基于大语言模型(LLM)的自动化渗透测试(AutoPT)因其能自动化执行道德黑客流程,并利用LLM的固有知识识别目标系统漏洞而受到关注。然而,现有基于LLM的AutoPT框架在复杂任务中表现常逊于人类专家,原因包括:LLM训练中使用的知识分布不均衡、规划过程中缺乏远见,以及命令生成时的幻觉问题。此外,渗透测试(PT)过程具有试错特性,但现有框架缺乏从先前失败操作中学习的机制,限制了PT策略的自适应改进。为解决这些局限性,我们提出了一种基于LLM的知识引导自反思PT框架RefPentester。该AutoPT框架能协助操作人员完成以下功能:识别PT流程当前阶段、为各阶段选择合适战术与技术、筛选建议操作、提供分步执行指导,并从历史失败操作中学习。我们还将PT过程建模为七状态阶段机,以有效集成该框架。评估表明,RefPentester能成功获取Hack The Box平台Sau机器的凭证,性能较基线GPT-4o模型提升16.7%。在各PT阶段中,RefPentester在阶段转换成功率方面也展现出显著优势。


Measuring General Intelligence with Generated Games

Abstract

arXiv:2505.07215v1 Announce Type: new Abstract: We present gg-bench, a collection of game environments designed to evaluate general reasoning capabilities in language models. Unlike most static benchmarks, gg-bench is a data generating process where new evaluation instances can be generated at will. In particular, gg-bench is synthetically generated by (1) using a large language model (LLM) to generate natural language descriptions of novel games, (2) using the LLM to implement each game in code as a Gym environment, and (3) training reinforcement learning (RL) agents via self-play on the generated games. We evaluate language models by their winrate against these RL agents by prompting models with the game description, current board state, and a list of valid moves, after which models output the moves they wish to take. gg-bench is challenging: state-of-the-art LLMs such as GPT-4o and Claude 3.7 Sonnet achieve winrates of 7-9% on gg-bench using in-context learning, while reasoning models such as o1, o3-mini and DeepSeek-R1 achieve average winrates of 31-36%. We release the generated games, data generation process, and evaluation code in order to support future modeling work and expansion of our benchmark.

摘要

我们提出gg-bench——一个用于评估语言模型通用推理能力的游戏环境集合。与大多数静态基准测试不同,gg-bench是一个可随时生成新评估实例的数据生成系统。该基准通过以下步骤合成生成:(1)使用大语言模型(LLM)生成新颖游戏的自然语言描述,(2)利用LLM将每个游戏编码实现为Gym环境,(3)通过自我对弈在生成游戏上训练强化学习(RL)智能体。我们通过语言模型与这些RL智能体的胜率对其进行评估:向模型提供游戏描述、当前棋盘状态和有效移动列表后,模型输出其选择的移动操作。gg-bench具有显著挑战性:采用上下文学习时,最先进的LLM(如GPT-4o和Claude 3.7 Sonnet)仅获得7-9%的胜率,而推理专用模型(如o1、o3-mini和DeepSeek-R1)平均胜率达到31-36%。我们公开了生成的游戏、数据生成流程和评估代码,以支持后续建模研究及基准扩展工作。


PrefillOnly: An Inference Engine for Prefill-only Workloads in Large Language Model Applications

Abstract

arXiv:2505.07203v1 Announce Type: new Abstract: Besides typical generative applications, like ChatGPT, GitHub Copilot, and Cursor, we observe an emerging trend that LLMs are increasingly used in traditional discriminative tasks, such as recommendation, credit verification, and data labeling. The key characteristic of these emerging use cases is that the LLM generates only a single output token, rather than an arbitrarily long sequence of tokens. We call this prefill-only workload. However, since existing LLM engines assume arbitrary output lengths, they fail to leverage the unique properties of prefill-only workloads. In this paper, we present PrefillOnly, the first LLM inference engine that improves the inference throughput and latency by fully embracing the properties of prefill-only workloads. First, since it generates only one token, PrefillOnly only needs to store the KV cache of only the last computed layer, rather than of all layers. This drastically reduces the GPU memory footprint of LLM inference and allows handling long inputs without using solutions that reduces throughput, such as cross-GPU KV cache parallelization. Second, because the output length is fixed, rather than arbitrary, PrefillOnly can precisely determine the job completion time (JCT) of each prefill-only request before it starts. This enables efficient JCT-aware scheduling policies such as shortest remaining job first. PrefillOnly can process upto 4x larger queries per second without inflating average and P99 latency.

摘要

除ChatGPT、GitHub Copilot和Cursor等典型生成式应用外,我们观察到大型语言模型(LLM)正日益应用于传统判别式任务(如推荐、信用验证和数据标注)的新趋势。这些新兴用例的关键特征是LLM仅生成单个输出标记,而非任意长度的标记序列。我们将此类工作负载称为预填充专属型。然而,由于现有LLM引擎假设输出长度可变,它们未能利用预填充专属型工作负载的独有特性。本文提出PrefillOnly——首个通过充分适配预填充专属型工作负载特性来提升推理吞吐量与延迟的LLM推理引擎。首先,由于仅需生成单标记,PrefillOnly只需存储最后一计算层的KV缓存,而非全层缓存。这显著降低了LLM推理的GPU内存占用,并能在不采用跨GPU KV缓存并行化等降低吞吐量的解决方案前提下处理长输入。其次,因输出长度固定而非可变,PrefillOnly可在请求开始前精确判定每个预填充专属型请求的作业完成时间(JCT),从而实现最短剩余作业优先等高效的JCT感知调度策略。PrefillWithout每秒可处理的查询量提升达4倍,且不会增加平均延迟与P99延迟。


Web-Bench: A LLM Code Benchmark Based on Web Standards and Frameworks

Abstract

arXiv:2505.07473v1 Announce Type: new Abstract: The application of large language models (LLMs) in the field of coding is evolving rapidly: from code assistants, to autonomous coding agents, and then to generating complete projects through natural language. Early LLM code benchmarks primarily focused on code generation accuracy, but these benchmarks have gradually become saturated. Benchmark saturation weakens their guiding role for LLMs. For example, HumanEval Pass@1 has reached 99.4% and MBPP 94.2%. Among various attempts to address benchmark saturation, approaches based on software engineering have stood out, but the saturation of existing software engineering benchmarks is rapidly increasing. To address this, we propose a new benchmark, Web-Bench, which contains 50 projects, each consisting of 20 tasks with sequential dependencies. The tasks implement project features in sequence, simulating real-world human development workflows. When designing Web-Bench, we aim to cover the foundational elements of Web development: Web Standards and Web Frameworks. Given the scale and complexity of these projects, which were designed by engineers with 5 to 10 years of experience, each presents a significant challenge. On average, a single project takes 4 to 8 hours for a senior engineer to complete. On our given benchmark agent (Web-Agent), SOTA (Claude 3.7 Sonnet) achieves only 25.1% Pass@1, significantly lower (better) than SWE-Bench's Verified (65.4%) and Full (33.8%) scores. Finally, we discuss that in any development field, Standards and Frameworks represent foundational knowledge and efficiency tools, respectively, and LLMs require optimization tailored to them.


RAI: Flexible Agent Framework for Embodied AI

Abstract

arXiv:2505.07532v1 Announce Type: new Abstract: With an increase in the capabilities of generative language models, a growing interest in embodied AI has followed. This contribution introduces RAI - a framework for creating embodied Multi Agent Systems for robotics. The proposed framework implements tools for Agents' integration with robotic stacks, Large Language Models, and simulations. It provides out-of-the-box integration with state-of-the-art systems like ROS 2. It also comes with dedicated mechanisms for the embodiment of Agents. These mechanisms have been tested on a physical robot, Husarion ROSBot XL, which was coupled with its digital twin, for rapid prototyping. Furthermore, these mechanisms have been deployed in two simulations: (1) robot arm manipulator and (2) tractor controller. All of these deployments have been evaluated in terms of their control capabilities, effectiveness of embodiment, and perception ability. The proposed framework has been used successfully to build systems with multiple agents. It has demonstrated effectiveness in all the aforementioned tasks. It also enabled identifying and addressing the shortcomings of the generative models used for embodied AI.

摘要

随着生成式语言模型能力的提升,人们对具身人工智能的关注度日益增长。本研究提出RAI框架——一个用于构建机器人具身多智能体系统的开发框架。该框架实现了智能体与机器人技术栈、大语言模型及仿真环境的集成工具,提供与ROS 2等前沿系统的开箱即用集成,并包含专门的智能体具身化机制。这些机制已在Husarion ROSBot XL物理机器人及其数字孪生体上完成测试,用于快速原型开发。此外,这些机制还部署于两种仿真环境:(1)机械臂操控系统 (2)拖拉机控制系统。所有部署案例均从控制能力、具身化效果和感知能力三个维度进行了评估。实验表明,该框架能有效构建多智能体系统,在上述所有任务中均表现出良好性能,同时有助于发现并改进用于具身AI的生成模型存在的缺陷。


A Survey on Collaborative Mechanisms Between Large and Small Language Models

Abstract

arXiv:2505.07460v1 Announce Type: new Abstract: Large Language Models (LLMs) deliver powerful AI capabilities but face deployment challenges due to high resource costs and latency, whereas Small Language Models (SLMs) offer efficiency and deployability at the cost of reduced performance. Collaboration between LLMs and SLMs emerges as a crucial paradigm to synergistically balance these trade-offs, enabling advanced AI applications, especially on resource-constrained edge devices. This survey provides a comprehensive overview of LLM-SLM collaboration, detailing various interaction mechanisms (pipeline, routing, auxiliary, distillation, fusion), key enabling technologies, and diverse application scenarios driven by on-device needs like low latency, privacy, personalization, and offline operation. While highlighting the significant potential for creating more efficient, adaptable, and accessible AI, we also discuss persistent challenges including system overhead, inter-model consistency, robust task allocation, evaluation complexity, and security/privacy concerns. Future directions point towards more intelligent adaptive frameworks, deeper model fusion, and expansion into multimodal and embodied AI, positioning LLM-SLM collaboration as a key driver for the next generation of practical and ubiquitous artificial intelligence.

摘要

大语言模型(LLMs)虽能提供强大的人工智能能力,却因高资源成本与延迟问题面临部署挑战;而小语言模型(SLMs)虽具备高效性和易部署优势,但性能有所降低。LLM与SLM的协作成为协同平衡这些权衡的关键范式,可推动先进AI应用在资源受限的边缘设备上落地。本综述全面阐述了LLM-SLM协作框架,系统梳理了多种交互机制(流水线、路由、辅助、蒸馏、融合)、关键使能技术,以及由低延迟、隐私保护、个性化、离线操作等设备端需求驱动的多样化应用场景。在强调该范式对构建更高效、适应性强且普惠化AI显著潜力的同时,我们也探讨了持续存在的挑战,包括系统开销、模型间一致性、鲁棒任务分配、评估复杂性及安全/隐私问题。未来研究方向指向更智能的自适应框架、更深度的模型融合,以及向多模态与具身AI的拓展,这些趋势将推动LLM-SLM协作成为下一代实用化、普适化人工智能的核心驱动力。


QuantX: A Framework for Hardware-Aware Quantization of Generative AI Workloads

Abstract

arXiv:2505.07531v1 Announce Type: new Abstract: We present QuantX: a tailored suite of recipes for LLM and VLM quantization. It is capable of quantizing down to 3-bit resolutions with minimal loss in performance. The quantization strategies in QuantX take into account hardware-specific constraints to achieve efficient dequantization during inference ensuring flexible trade-off between runtime speed, memory requirement and model accuracy. Our results demonstrate that QuantX achieves performance within 6% of the unquantized model for LlaVa-v1.6 quantized down to 3-bits for multiple end user tasks and outperforms recently published state-of-the-art quantization techniques. This manuscript provides insights into the LLM quantization process that motivated the range of recipes and options that are incorporated in QuantX.

摘要

我们提出QuantX:一套专为LLM和VLM量化定制的方案集。该方案能够将模型量化至3比特分辨率且性能损失极小。QuantX中的量化策略考虑了硬件特定约束,以确保推理过程中实现高效反量化,从而在运行速度、内存需求和模型精度之间实现灵活权衡。实验结果表明,在将LlaVa-v1.6量化为3比特后,QuantX在多项终端用户任务中性能损失不超过未量化模型的6%,且优于近期发表的最先进量化技术。本文深入剖析了LLM量化过程中启发QuantX方案设计的技术思路,阐释了集成在该工具中的多种量化策略与选项。


How well do LLMs reason over tabular data, really?

Abstract

arXiv:2505.07453v1 Announce Type: new Abstract: Large Language Models (LLMs) excel in natural language tasks, but less is known about their reasoning capabilities over tabular data. Prior analyses devise evaluation strategies that poorly reflect an LLM's realistic performance on tabular queries. Moreover, we have a limited understanding of the robustness of LLMs towards realistic variations in tabular inputs. Therefore, we ask: Can general-purpose LLMs reason over tabular data, really?, and focus on two questions 1) are tabular reasoning capabilities of general-purpose LLMs robust to real-world characteristics of tabular inputs, and 2) how can we realistically evaluate an LLM's performance on analytical tabular queries? Building on a recent tabular reasoning benchmark, we first surface shortcomings of its multiple-choice prompt evaluation strategy, as well as commonly used free-form text metrics such as SacreBleu and BERT-score. We show that an LLM-as-a-judge procedure yields more reliable performance insights and unveil a significant deficit in tabular reasoning performance of LLMs. We then extend the tabular inputs reflecting three common characteristics in practice: 1) missing values, 2) duplicate entities, and 3) structural variations. Experiments show that the tabular reasoning capabilities of general-purpose LLMs suffer from these variations, stressing the importance of improving their robustness for realistic tabular inputs.

摘要

大语言模型(LLMs)在自然语言任务中表现出色,但其对表格数据的推理能力尚不明确。现有分析采用的评估策略难以真实反映LLMs在表格查询中的实际表现。此外,我们对于LLMs在表格输入现实变化中的鲁棒性理解有限。因此,我们提出核心问题:通用LLMs能否真正实现表格数据推理?并聚焦两个具体问题:1)通用LLMs的表格推理能力是否对现实表格输入特征具有鲁棒性;2)如何真实评估LLMs在分析性表格查询中的性能?基于近期表格推理基准,我们首先揭示其多项选择提示评估策略的缺陷,以及SacreBleu、BERT-score等常用自由文本指标的局限性。研究表明,采用LLM-as-a-judge(LLM作为评判者)方法能获得更可靠的性能评估,并暴露出LLMs在表格推理性能上的显著不足。随后,我们扩展表格输入以反映三种常见现实特征:1)缺失值,2)重复实体,3)结构变异。实验表明,通用LLMs的表格推理能力受这些变化影响显著,这凸显了提升其对现实表格输入鲁棒性的重要性。


YuLan-OneSim: Towards the Next Generation of Social Simulator with Large Language Models

Abstract

arXiv:2505.07581v1 Announce Type: new Abstract: Leveraging large language model (LLM) based agents to simulate human social behaviors has recently gained significant attention. In this paper, we introduce a novel social simulator called YuLan-OneSim. Compared to previous works, YuLan-OneSim distinguishes itself in five key aspects: (1) Code-free scenario construction: Users can simply describe and refine their simulation scenarios through natural language interactions with our simulator. All simulation code is automatically generated, significantly reducing the need for programming expertise. (2) Comprehensive default scenarios: We implement 50 default simulation scenarios spanning 8 domains, including economics, sociology, politics, psychology, organization, demographics, law, and communication, broadening access for a diverse range of social researchers. (3) Evolvable simulation: Our simulator is capable of receiving external feedback and automatically fine-tuning the backbone LLMs, significantly enhancing the simulation quality. (4) Large-scale simulation: By developing a fully responsive agent framework and a distributed simulation architecture, our simulator can handle up to 100,000 agents, ensuring more stable and reliable simulation results. (5) AI social researcher: Leveraging the above features, we develop an AI social researcher. Users only need to propose a research topic, and the AI researcher will automatically analyze the input, construct simulation environments, summarize results, generate technical reports, review and refine the reports--completing the social science research loop. To demonstrate the advantages of YuLan-OneSim, we conduct experiments to evaluate the quality of the automatically generated scenarios, the reliability, efficiency, and scalability of the simulation process, as well as the performance of the AI social researcher.

摘要

利用基于大语言模型(LLM)的智能体模拟人类社会行为近期受到广泛关注。本文提出新型社交模拟器"玉兰-OneSim",相较已有研究具备五大创新点:(1) 无代码场景构建:用户仅需通过自然语言交互描述并优化模拟场景,所有模拟代码自动生成,大幅降低编程技能需求;(2) 完备默认场景库:实现涵盖经济、社会学、政治、心理学、组织、人口统计、法律与传播8大领域的50个默认场景,拓宽跨学科社会科学研究者的使用边界;(3) 可进化模拟:支持接收外部反馈并自动微调底层LLM,显著提升模拟质量;(4) 大规模模拟:通过开发全响应式智能体框架与分布式仿真架构,可支持10万级智能体规模,确保结果更稳定可靠;(5) AI社会研究员:基于上述功能开发人工智能研究助手,用户仅需提出研究主题,系统即可自动完成分析输入、构建环境、结果汇总、生成技术报告及迭代优化等全流程社会科学研究闭环。为验证玉兰-OneSim优势,我们通过实验评估了自动生成场景质量、模拟过程的可靠性、效率与可扩展性,以及AI社会学者的综合表现。


S-GRPO: Early Exit via Reinforcement Learning in Reasoning Models

Abstract

arXiv:2505.07686v1 Announce Type: new Abstract: As Test-Time Scaling emerges as an active research focus in the large language model community, advanced post-training methods increasingly emphasize extending chain-of-thought (CoT) generation length, thereby enhancing reasoning capabilities to approach Deepseek R1-like reasoning models. However, recent studies reveal that reasoning models (even Qwen3) consistently exhibit excessive thought redundancy in CoT generation. This overthinking problem stems from conventional outcome-reward reinforcement learning's systematic neglect in regulating intermediate reasoning steps. This paper proposes Serial-Group Decaying-Reward Policy Optimization (namely S-GRPO), a novel reinforcement learning method that empowers models with the capability to determine the sufficiency of reasoning steps, subsequently triggering early exit of CoT generation. Specifically, unlike GRPO, which samples multiple possible completions (parallel group) in parallel, we select multiple temporal positions in the generation of one CoT to allow the model to exit thinking and instead generate answers (serial group), respectively. For the correct answers in a serial group, we assign rewards that decay according to positions, with lower rewards towards the later ones, thereby reinforcing the model's behavior to generate higher-quality answers at earlier phases with earlier exits of thinking. Empirical evaluations demonstrate compatibility with state-of-the-art reasoning models, including Qwen3 and Deepseek-distill models, achieving 35.4% ~ 61.1% sequence length reduction with 0.72% ~ 6.08% accuracy improvements across GSM8K, AIME 2024, AMC 2023, MATH-500, and GPQA Diamond benchmarks.

摘要

随着测试时缩放成为大语言模型领域的研究热点,先进的后训练方法日益注重扩展思维链生成长度,从而提升推理能力以接近Deepseek R1类推理模型。然而最新研究表明,推理模型(包括Qwen3)在思维链生成中普遍存在过度思考冗余现象。该问题源于传统结果奖励型强化学习对中间推理步骤调控的系统性忽视。本文提出序列化分组衰减奖励策略优化方法(简称S-GRPO),通过新型强化学习赋予模型自主判断推理充分性并触发思维链提前退出的能力。具体而言,与GRPO并行采样多组可能补全(并行分组)不同,我们在单条思维链生成过程中选取多个时序位置(序列分组),分别允许模型退出思考并输出答案。对于序列分组中的正确答案,我们按位置实施衰减奖励分配,后期答案奖励递减,从而强化模型在更早阶段生成高质量答案并提前终止思考的行为。实证评估表明,本方法与Qwen3、Deepseek-distill等前沿推理模型兼容,在GSM8K、AIME 2024、AMC 2023、MATH-500和GPQA Diamond基准测试中实现35.4%~61.1%的序列长度缩减,同时获得0.72%~6.08%的准确率提升。


Agent RL Scaling Law: Agent RL with Spontaneous Code Execution for Mathematical Problem Solving

Abstract

arXiv:2505.07773v1 Announce Type: new Abstract: Large Language Models (LLMs) often struggle with mathematical reasoning tasks requiring precise, verifiable computation. While Reinforcement Learning (RL) from outcome-based rewards enhances text-based reasoning, understanding how agents autonomously learn to leverage external tools like code execution remains crucial. We investigate RL from outcome-based rewards for Tool-Integrated Reasoning, ZeroTIR, training base LLMs to spontaneously generate and execute Python code for mathematical problems without supervised tool-use examples. Our central contribution is we demonstrate that as RL training progresses, key metrics scale predictably. Specifically, we observe strong positive correlations where increased training steps lead to increases in the spontaneous code execution frequency, the average response length, and, critically, the final task accuracy. This suggests a quantifiable relationship between computational effort invested in training and the emergence of effective, tool-augmented reasoning strategies. We implement a robust framework featuring a decoupled code execution environment and validate our findings across standard RL algorithms and frameworks. Experiments show ZeroTIR significantly surpasses non-tool ZeroRL baselines on challenging math benchmarks. Our findings provide a foundational understanding of how autonomous tool use is acquired and scales within Agent RL, offering a reproducible benchmark for future studies. Code is released at \href{https://github.com/Anonymize-Author/AgentRL}{https://github.com/Anonymize-Author/AgentRL}.

摘要

大型语言模型(LLMs)在处理需要精确可验证计算的数学推理任务时常常表现不佳。虽然基于结果奖励的强化学习(RL)能提升文本推理能力,但理解智能体如何自主学习利用代码执行等外部工具仍至关重要。我们研究了基于结果奖励的强化学习在工具集成推理(ZeroTIR)中的应用,该方法通过训练基础LLM模型,使其无需监督工具使用示例即可自主生成并执行Python代码来解决数学问题。我们的核心贡献在于证明随着RL训练的推进,关键指标呈现可预测的规律性变化:训练步数增加会显著提高自主代码执行频率、平均响应长度以及最终任务准确率,这些指标间存在强正相关性。这表明训练投入的计算资源与有效工具增强推理策略的涌现之间存在可量化的关联。我们实现了一个包含解耦代码执行环境的鲁棒框架,并在标准RL算法和框架中验证了发现。实验表明,在复杂数学基准测试上,ZeroTIR显著优于非工具使用的ZeroRL基线。本研究为理解自主工具使用在智能体强化学习中的获取与扩展机制提供了基础性认知,并为未来研究提供了可复现的基准。代码发布于https://github.com/Anonymize-Author/AgentRL。


Dialz: A Python Toolkit for Steering Vectors

Abstract

arXiv:2505.06262v1 Announce Type: cross Abstract: We introduce Dialz, a framework for advancing research on steering vectors for open-source LLMs, implemented in Python. Steering vectors allow users to modify activations at inference time to amplify or weaken a 'concept', e.g. honesty or positivity, providing a more powerful alternative to prompting or fine-tuning. Dialz supports a diverse set of tasks, including creating contrastive pair datasets, computing and applying steering vectors, and visualizations. Unlike existing libraries, Dialz emphasizes modularity and usability, enabling both rapid prototyping and in-depth analysis. We demonstrate how Dialz can be used to reduce harmful outputs such as stereotypes, while also providing insights into model behaviour across different layers. We release Dialz with full documentation, tutorials, and support for popular open-source models to encourage further research in safe and controllable language generation. Dialz enables faster research cycles and facilitates insights into model interpretability, paving the way for safer, more transparent, and more reliable AI systems.

摘要

我们推出Dialz框架——一个基于Python实现、用于推进开源大语言模型导向向量研究的平台。导向向量技术允许用户在推理阶段通过修改激活值来增强或抑制特定"概念"(如诚实性或积极性),相比提示工程或微调方法具有更强大的调控能力。该框架支持创建对比配对数据集、计算与应用导向向量以及可视化分析等多样化任务。与现有工具库不同,Dialz强调模块化设计与可用性,既能支持快速原型开发,也能满足深度分析需求。我们展示了如何利用Dialz减少模型输出中的有害内容(如刻板印象),同时揭示模型在不同网络层的行为特征。本框架提供完整文档、教程及主流开源模型支持,以促进安全可控语言生成领域的深入研究。Dialz通过加速研究周期和增强模型可解释性分析,为构建更安全、透明、可靠的人工智能系统开辟了新途径。


AKD : Adversarial Knowledge Distillation For Large Language Models Alignment on Coding tasks

Abstract

arXiv:2505.06267v1 Announce Type: cross Abstract: The widespread adoption of Large Language Models (LLMs) for code generation, exemplified by GitHub Copilot\footnote{A coding extension powered by a Code-LLM to assist in code completion tasks} surpassing a million users, highlights the transformative potential of these tools in improving developer productivity. However, this rapid growth also underscores critical concerns regarding the quality, safety, and reliability of the code they generate. As Code-LLMs evolve, they face significant challenges, including the diminishing returns of model scaling and the scarcity of new, high-quality training data. To address these issues, this paper introduces Adversarial Knowledge Distillation (AKD), a novel approach that leverages adversarially generated synthetic datasets to distill the capabilities of larger models into smaller, more efficient ones. By systematically stress-testing and refining the reasoning capabilities of Code-LLMs, AKD provides a framework for enhancing model robustness, reliability, and security while improving their parameter-efficiency. We believe this work represents a critical step toward ensuring dependable automated code generation within the constraints of existing data and the cost-efficiency of model execution.

摘要

以GitHub Copilot(注:一种由代码大语言模型驱动的编程扩展工具,用于辅助代码补全任务)用户数突破百万为标志,大语言模型在代码生成领域的广泛应用彰显了这些工具在提升开发者生产力方面的变革潜力。然而,这种快速增长也凸显出关于生成代码质量、安全性和可靠性的关键问题。随着代码大语言模型的发展,它们面临着模型扩展收益递减和高质量新训练数据稀缺等重大挑战。为解决这些问题,本文提出对抗性知识蒸馏(AKD),该方法利用对抗生成的合成数据集,将大型模型的能力蒸馏至更小型高效的模型中。通过系统化的压力测试与模型推理能力优化,AKD为提升代码大语言模型的鲁棒性、可靠性和安全性提供了框架,同时改善了参数效率。我们相信这项研究代表了在现有数据限制和模型运行成本效益条件下,实现可靠自动化代码生成的关键一步。


User Behavior Analysis in Privacy Protection with Large Language Models: A Study on Privacy Preferences with Limited Data

Abstract

arXiv:2505.06305v1 Announce Type: cross Abstract: With the widespread application of large language models (LLMs), user privacy protection has become a significant research topic. Existing privacy preference modeling methods often rely on large-scale user data, making effective privacy preference analysis challenging in data-limited environments. This study explores how LLMs can analyze user behavior related to privacy protection in scenarios with limited data and proposes a method that integrates Few-shot Learning and Privacy Computing to model user privacy preferences. The research utilizes anonymized user privacy settings data, survey responses, and simulated data, comparing the performance of traditional modeling approaches with LLM-based methods. Experimental results demonstrate that, even with limited data, LLMs significantly improve the accuracy of privacy preference modeling. Additionally, incorporating Differential Privacy and Federated Learning further reduces the risk of user data exposure. The findings provide new insights into the application of LLMs in privacy protection and offer theoretical support for advancing privacy computing and user behavior analysis.

摘要

随着大语言模型(LLMs)的广泛应用,用户隐私保护已成为重要研究课题。现有隐私偏好建模方法通常依赖大规模用户数据,在数据有限环境下难以有效进行隐私偏好分析。本研究探索LLMs如何在数据受限场景下分析用户隐私保护行为,提出融合小样本学习与隐私计算的方法来建模用户隐私偏好。研究采用匿名化用户隐私设置数据、调查问卷反馈和模拟数据,比较传统建模方法与基于LLM方法的性能。实验结果表明,即使在有限数据条件下,LLMs也能显著提升隐私偏好建模的准确性。此外,结合差分隐私和联邦学习技术可进一步降低用户数据暴露风险。这些发现为LLMs在隐私保护领域的应用提供了新思路,并为推进隐私计算和用户行为分析提供了理论支持。


PARM: Multi-Objective Test-Time Alignment via Preference-Aware Autoregressive Reward Model

Abstract

arXiv:2505.06274v1 Announce Type: cross Abstract: Multi-objective test-time alignment aims to adapt large language models (LLMs) to diverse multi-dimensional user preferences during inference while keeping LLMs frozen. Recently, GenARM (Xu et al., 2025) first independently trains Autoregressive Reward Models (ARMs) for each preference dimension without awareness of each other, then combines their outputs based on user-specific preference vectors during inference to achieve multi-objective test-time alignment, leading to two key limitations: the need for \textit{multiple} ARMs increases the inference cost, and the separate training of ARMs causes the misalignment between the guided generation and the user preferences. To address these issues, we propose Preference-aware ARM (PARM), a single unified ARM trained across all preference dimensions. PARM uses our proposed Preference-Aware Bilinear Low-Rank Adaptation (PBLoRA), which employs a bilinear form to condition the ARM on preference vectors, enabling it to achieve precise control over preference trade-offs during inference. Experiments demonstrate that PARM reduces inference costs and achieves better alignment with preference vectors compared with existing methods. Additionally, PARM enables weak-to-strong guidance, allowing a smaller PARM to guide a larger frozen LLM without expensive training, making multi-objective alignment accessible with limited computing resources. The code is available at https://github.com/Baijiong-Lin/PARM.

摘要

多目标测试时对齐旨在保持大型语言模型(LLMs)参数冻结的前提下,使其在推理阶段适应多样化的多维用户偏好。近期提出的GenARM(Xu等人,2025)通过为每个偏好维度独立训练自回归奖励模型(ARMs),并在推理时基于用户特定偏好向量组合其输出以实现多目标对齐,但存在两个关键局限:需要维护多个ARMs导致推理成本增加,且ARMs的独立训练会导致生成结果与用户偏好失配。为解决这些问题,我们提出偏好感知自回归奖励模型(PARM),这是一种跨所有偏好维度联合训练的单一统一模型。PARM采用我们设计的偏好感知双线性低秩适配(PBLoRA),通过双线性形式将偏好向量作为ARM的条件输入,从而在推理时实现精准的偏好权衡控制。实验表明,与现有方法相比,PARM在降低推理成本的同时能更好地对齐用户偏好向量。此外,PARM支持弱到强引导能力,使得较小规模的PARM无需昂贵训练即可引导更大规模的冻结LLM,为计算资源受限场景下的多目标对齐提供了可行方案。代码已开源:https://github.com/Baijiong-Lin/PARM。


A Sensitivity-Driven Expert Allocation Method in LoRA-MoE for Efficient Fine-Tuning

Abstract

arXiv:2505.06272v1 Announce Type: cross Abstract: As deep learning models expand, the pre-training-fine-tuning paradigm has become the standard approach for handling various downstream tasks. However, shared parameters can lead to diminished performance when dealing with complex datasets involving multiple tasks. While introducing Mixture-of-Experts (MoE) methods has alleviated this issue to some extent, it also significantly increases the number of parameters required for fine-tuning and training time, introducing greater parameter redundancy. To address these challenges, we propose a method for allocating expert numbers based on parameter sensitivity LoRA-SMoE (A Sensitivity-Driven Expert Allocation Method in LoRA-MoE for Efficient Fine-Tuning). This method rapidly assesses the sensitivity of different tasks to parameters by sampling a small amount of data and using gradient information. It then adaptively allocates expert numbers within a given budget. The process maintains comparable memory consumption to LoRA (Low-Rank Adaptation) while ensuring an efficient and resource-friendly fine-tuning procedure. Experimental results demonstrate that compared to SOTA fine-tuning methods, our LoRA-SMoE approach can enhance model performance while reducing the number of trainable parameters. This significantly improves model performance in resource-constrained environments. Additionally, due to its efficient parameter sensitivity evaluation mechanism, LoRA-SMoE requires minimal computational overhead to optimize expert allocation, making it particularly suitable for scenarios with limited computational resources. All the code in this study will be made publicly available following the acceptance of the paper for publication. Source code is at https://github.com/EMLS-ICTCAS/LoRA-SMoE

摘要

随着深度学习模型规模的扩大,预训练-微调范式已成为处理各类下游任务的标准方法。然而在面对涉及多任务的复杂数据集时,共享参数会导致性能下降。虽然引入混合专家(MoE)方法在一定程度上缓解了该问题,但也显著增加了微调所需的参数量和训练时间,带来更大的参数冗余。为解决这些挑战,我们提出了一种基于参数敏感性的专家数量分配方法LoRA-SMoE(面向高效微调的LoRA-MoE中基于敏感性的专家分配方法)。该方法通过采样少量数据并利用梯度信息,快速评估不同任务对参数的敏感性,进而在给定预算内自适应分配专家数量。该过程保持与低秩适应(LoRA)相当的内存消耗,同时确保高效且资源友好的微调过程。实验结果表明,相比最先进的微调方法,我们的LoRA-SMoE方法能在减少可训练参数量的同时提升模型性能,显著改善了资源受限环境下的模型表现。此外得益于高效的参数敏感性评估机制,LoRA-SMoE仅需极小计算开销即可优化专家分配,特别适用于计算资源有限的场景。本研究所有代码将在论文录用后公开,源代码详见https://github.com/EMLS-ICTCAS/LoRA-SMoE。


Defending against Indirect Prompt Injection by Instruction Detection

Abstract

arXiv:2505.06311v1 Announce Type: cross Abstract: The integration of Large Language Models (LLMs) with external sources is becoming increasingly common, with Retrieval-Augmented Generation (RAG) being a prominent example. However, this integration introduces vulnerabilities of Indirect Prompt Injection (IPI) attacks, where hidden instructions embedded in external data can manipulate LLMs into executing unintended or harmful actions. We recognize that the success of IPI attacks fundamentally relies in the presence of instructions embedded within external content, which can alter the behavioral state of LLMs. Can effectively detecting such state changes help us defend against IPI attacks? In this paper, we propose a novel approach that takes external data as input and leverages the behavioral state of LLMs during both forward and backward propagation to detect potential IPI attacks. Specifically, we demonstrate that the hidden states and gradients from intermediate layers provide highly discriminative features for instruction detection. By effectively combining these features, our approach achieves a detection accuracy of 99.60% in the in-domain setting and 96.90% in the out-of-domain setting, while reducing the attack success rate to just 0.12% on the BIPIA benchmark.

摘要

大型语言模型(LLMs)与外部数据源的整合日益普遍,其中检索增强生成(RAG)是典型应用。然而这种整合引入了间接提示注入(IPI)攻击的漏洞——外部数据中嵌入的隐藏指令可操纵LLMs执行非预期或有害行为。我们发现IPI攻击的成功本质上依赖于外部内容中能改变LLMs行为状态的嵌入指令。能否通过有效检测此类状态变化来防御IPI攻击?本文提出一种创新方法:以外源数据为输入,利用LLMs在前向传播和反向传播期间的行为状态来检测潜在IPI攻击。具体而言,我们证明中间层的隐藏状态和梯度能为指令检测提供高区分度特征。通过有效组合这些特征,本方法在域内设置下达到99.60%的检测准确率,在域外设置下达到96.90%,同时在BIPIA基准测试中将攻击成功率降至仅0.12%。


QiMeng-TensorOp: Automatically Generating High-Performance Tensor Operators with Hardware Primitives

Abstract

arXiv:2505.06302v1 Announce Type: cross Abstract: Computation-intensive tensor operators constitute over 90% of the computations in Large Language Models (LLMs) and Deep Neural Networks.Automatically and efficiently generating high-performance tensor operators with hardware primitives is crucial for diverse and ever-evolving hardware architectures like RISC-V, ARM, and GPUs, as manually optimized implementation takes at least months and lacks portability.LLMs excel at generating high-level language codes, but they struggle to fully comprehend hardware characteristics and produce high-performance tensor operators. We introduce a tensor-operator auto-generation framework with a one-line user prompt (QiMeng-TensorOp), which enables LLMs to automatically exploit hardware characteristics to generate tensor operators with hardware primitives, and tune parameters for optimal performance across diverse hardware. Experimental results on various hardware platforms, SOTA LLMs, and typical tensor operators demonstrate that QiMeng-TensorOp effectively unleashes the computing capability of various hardware platforms, and automatically generates tensor operators of superior performance. Compared with vanilla LLMs, QiMeng-TensorOp achieves up to 1291×1291 \times performance improvement. Even compared with human experts, QiMeng-TensorOp could reach 251%251 \% of OpenBLAS on RISC-V CPUs, and 124%124 \% of cuBLAS on NVIDIA GPUs. Additionally, QiMeng-TensorOp also significantly reduces development costs by 200×200 \times compared with human experts.

摘要

计算密集型张量算子占大型语言模型(LLM)和深度神经网络中90%以上的计算量。针对RISC-V、ARM和GPU等多样化且不断演进的硬件架构,自动高效生成具有硬件原语的高性能张量算子至关重要,因为人工优化实现至少需要数月时间且缺乏可移植性。虽然LLM擅长生成高级语言代码,但其难以充分理解硬件特性并生成高性能张量算子。我们提出了一种单行用户提示的张量算子自动生成框架(启梦-张量算子),使LLM能自动利用硬件特性生成包含硬件原语的张量算子,并通过参数调优实现跨硬件平台的性能优化。在不同硬件平台、前沿LLM和典型张量算子上的实验结果表明,启梦-张量算子能有效释放各类硬件平台的计算潜力,自动生成性能优越的张量算子。相较于原始LLM,启梦-张量算子实现了最高1291倍的性能提升。即使与人类专家相比,该框架在RISC-V CPU上可达OpenBLAS的251%,在NVIDIA GPU上可达cuBLAS的124%。此外,启梦-张量算子还将开发成本较人类专家降低200倍。


Collaborative Multi-LoRA Experts with Achievement-based Multi-Tasks Loss for Unified Multimodal Information Extraction

Abstract

arXiv:2505.06303v1 Announce Type: cross Abstract: Multimodal Information Extraction (MIE) has gained attention for extracting structured information from multimedia sources. Traditional methods tackle MIE tasks separately, missing opportunities to share knowledge across tasks. Recent approaches unify these tasks into a generation problem using instruction-based T5 models with visual adaptors, optimized through full-parameter fine-tuning. However, this method is computationally intensive, and multi-task fine-tuning often faces gradient conflicts, limiting performance. To address these challenges, we propose collaborative multi-LoRA experts with achievement-based multi-task loss (C-LoRAE) for MIE tasks. C-LoRAE extends the low-rank adaptation (LoRA) method by incorporating a universal expert to learn shared multimodal knowledge from cross-MIE tasks and task-specific experts to learn specialized instructional task features. This configuration enhances the model's generalization ability across multiple tasks while maintaining the independence of various instruction tasks and mitigating gradient conflicts. Additionally, we propose an achievement-based multi-task loss to balance training progress across tasks, addressing the imbalance caused by varying numbers of training samples in MIE tasks. Experimental results on seven benchmark datasets across three key MIE tasks demonstrate that C-LoRAE achieves superior overall performance compared to traditional fine-tuning methods and LoRA methods while utilizing a comparable number of training parameters to LoRA.

摘要

多模态信息抽取(MIE)因能从多媒体源中提取结构化信息而受到关注。传统方法单独处理MIE任务,错失了跨任务知识共享的机会。近期研究通过基于指令的T5模型与视觉适配器将这些任务统一为生成问题,并采用全参数微调进行优化。然而该方法计算成本高昂,且多任务微调常面临梯度冲突,限制了性能表现。为应对这些挑战,我们提出基于成就的多任务损失协同多LoRA专家模型(C-LoRAE)。C-LoRAE扩展了低秩自适应(LoRA)方法,通过引入通用专家学习跨MIE任务的共享多模态知识,以及任务特定专家学习专用指令任务特征。该架构在增强模型多任务泛化能力的同时,保持了各指令任务的独立性并缓解梯度冲突。此外,我们提出基于成就的多任务损失函数,通过平衡各任务训练进度,解决MIE任务中训练样本数量不均导致的失衡问题。在三个关键MIE任务、七个基准数据集上的实验表明,C-LoRAE在使用与LoRA相当训练参数量的情况下,整体性能优于传统微调方法和LoRA方法。


Learn to Think: Bootstrapping LLM Reasoning Capability Through Graph Learning

Abstract

arXiv:2505.06321v1 Announce Type: cross Abstract: Large Language Models (LLMs) have achieved remarkable success across various domains. However, they still face significant challenges, including high computational costs for training and limitations in solving complex reasoning problems. Although existing methods have extended the reasoning capabilities of LLMs through structured paradigms, these approaches often rely on task-specific prompts and predefined reasoning processes, which constrain their flexibility and generalizability. To address these limitations, we propose a novel framework that leverages graph learning to enable more flexible and adaptive reasoning capabilities for LLMs. Specifically, this approach models the reasoning process of a problem as a graph and employs LLM-based graph learning to guide the adaptive generation of each reasoning step. To further enhance the adaptability of the model, we introduce a Graph Neural Network (GNN) module to perform representation learning on the generated reasoning process, enabling real-time adjustments to both the model and the prompt. Experimental results demonstrate that this method significantly improves reasoning performance across multiple tasks without requiring additional training or task-specific prompt design. Code can be found in https://github.com/zch65458525/L2T.

摘要

大型语言模型(LLMs)已在多个领域取得显著成功,但仍面临训练计算成本高昂、解决复杂推理问题能力有限等重大挑战。尽管现有方法通过结构化范式扩展了LLMs的推理能力,但这些方法通常依赖于任务特定提示和预定义推理流程,限制了其灵活性与泛化性。为克服这些局限,我们提出一种创新框架,利用图学习技术赋予LLMs更灵活自适应的推理能力。该框架将问题推理过程建模为图结构,采用基于LLM的图学习技术自适应生成每个推理步骤。为进一步增强模型适应性,我们引入图神经网络(GNN)模块对生成的推理过程进行表征学习,实现模型与提示的实时动态调整。实验结果表明,该方法在无需额外训练或任务特定提示设计的情况下,显著提升了多任务推理性能。代码详见https://github.com/zch65458525/L2T。


AI Approaches to Qualitative and Quantitative News Analytics on NATO Unity

Abstract

arXiv:2505.06313v1 Announce Type: cross Abstract: The paper considers the use of GPT models with retrieval-augmented generation (RAG) for qualitative and quantitative analytics on NATO sentiments, NATO unity and NATO Article 5 trust opinion scores in different web sources: news sites found via Google Search API, Youtube videos with comments, and Reddit discussions. A RAG approach using GPT-4.1 model was applied to analyse news where NATO related topics were discussed. Two levels of RAG analytics were used: on the first level, the GPT model generates qualitative news summaries and quantitative opinion scores using zero-shot prompts; on the second level, the GPT model generates the summary of news summaries. Quantitative news opinion scores generated by the GPT model were analysed using Bayesian regression to get trend lines. The distributions found for the regression parameters make it possible to analyse an uncertainty in specified news opinion score trends. Obtained results show a downward trend for analysed scores of opinion related to NATO unity. This approach does not aim to conduct real political analysis; rather, it consider AI based approaches which can be used for further analytics as a part of a complex analytical approach. The obtained results demonstrate that the use of GPT models for news analysis can give informative qualitative and quantitative analytics, providing important insights. The dynamic model based on neural ordinary differential equations was considered for modelling public opinions. This approach makes it possible to analyse different scenarios for evolving public opinions.

摘要

本文探讨了利用检索增强生成(RAG)的GPT模型对不同网络资源(通过谷歌搜索API获取的新闻网站、含评论的YouTube视频及Reddit讨论)中涉及北约情绪、北约团结度及北约第五条信任度观点的定性与定量分析。研究采用基于GPT-4.1模型的RAG方法分析涉及北约议题的新闻报道,实施两级分析:第一级中GPT模型通过零样本提示生成定性新闻摘要与定量观点评分;第二级则对新闻摘要进行二次汇总。通过贝叶斯回归分析GPT模型生成的新闻观点评分数据以获取趋势线,回归参数的分布特性可用于评估特定新闻观点趋势的不确定性。结果显示与北约团结度相关的观点评分呈下降趋势。

该方法并非旨在进行实际政治分析,而是探索可作为复杂分析体系组成部分的AI技术方案。研究表明:GPT模型在新闻分析中能提供信息量丰富的定性与定量分析结果,具有重要参考价值。此外,研究采用基于神经常微分方程的动态模型模拟舆论演变,该技术可实现对不同舆论发展场景的分析。


Large Language Model-driven Security Assistant for Internet of Things via Chain-of-Thought

Abstract

arXiv:2505.06307v1 Announce Type: cross Abstract: The rapid development of Internet of Things (IoT) technology has transformed people's way of life and has a profound impact on both production and daily activities. However, with the rapid advancement of IoT technology, the security of IoT devices has become an unavoidable issue in both research and applications. Although some efforts have been made to detect or mitigate IoT security vulnerabilities, they often struggle to adapt to the complexity of IoT environments, especially when dealing with dynamic security scenarios. How to automatically, efficiently, and accurately understand these vulnerabilities remains a challenge. To address this, we propose an IoT security assistant driven by Large Language Model (LLM), which enhances the LLM's understanding of IoT security vulnerabilities and related threats. The aim of the ICoT method we propose is to enable the LLM to understand security issues by breaking down the various dimensions of security vulnerabilities and generating responses tailored to the user's specific needs and expertise level. By incorporating ICoT, LLM can gradually analyze and reason through complex security scenarios, resulting in more accurate, in-depth, and personalized security recommendations and solutions. Experimental results show that, compared to methods relying solely on LLM, our proposed LLM-driven IoT security assistant significantly improves the understanding of IoT security issues through the ICoT approach and provides personalized solutions based on the user's identity, demonstrating higher accuracy and reliability.

摘要

物联网(IoT)技术的快速发展改变了人们的生活方式,并对生产与日常活动产生了深远影响。然而,随着IoT技术的快速进步,物联网设备的安全性已成为研究和应用中不可回避的问题。尽管已有部分工作致力于检测或缓解IoT安全漏洞,但这些方法往往难以适应物联网环境的复杂性,尤其在处理动态安全场景时表现不足。如何自动化、高效且准确地理解这些漏洞仍是一项挑战。为此,我们提出了一种由大语言模型(LLM)驱动的IoT安全助手,通过增强LLM对物联网安全漏洞及相关威胁的理解能力来解决该问题。我们所提出的ICoT方法旨在通过分解安全漏洞的各个维度,并生成适应用户特定需求与专业水平的响应,使LLM能够理解安全问题。通过引入ICoT,LLM可以逐步分析和推理复杂的安全场景,从而提供更精准、深入且个性化的安全建议与解决方案。实验结果表明,与仅依赖LLM的方法相比,我们提出的LLM驱动型IoT安全助手通过ICoT方法显著提升了对物联网安全问题的理解能力,并能根据用户身份提供个性化解决方案,展现出更高的准确性与可靠性。


Document Attribution: Examining Citation Relationships using Large Language Models

Abstract

arXiv:2505.06324v1 Announce Type: cross Abstract: As Large Language Models (LLMs) are increasingly applied to document-based tasks - such as document summarization, question answering, and information extraction - where user requirements focus on retrieving information from provided documents rather than relying on the model's parametric knowledge, ensuring the trustworthiness and interpretability of these systems has become a critical concern. A central approach to addressing this challenge is attribution, which involves tracing the generated outputs back to their source documents. However, since LLMs can produce inaccurate or imprecise responses, it is crucial to assess the reliability of these citations. To tackle this, our work proposes two techniques. (1) A zero-shot approach that frames attribution as a straightforward textual entailment task. Our method using flan-ul2 demonstrates an improvement of 0.27% and 2.4% over the best baseline of ID and OOD sets of AttributionBench, respectively. (2) We also explore the role of the attention mechanism in enhancing the attribution process. Using a smaller LLM, flan-t5-small, the F1 scores outperform the baseline across almost all layers except layer 4 and layers 8 through 11.

摘要

随着大型语言模型(LLMs)日益广泛应用于基于文档的任务——如文档摘要、问答和信息抽取——在这些任务中用户需求侧重于从提供文档中检索信息,而非依赖模型的参数知识,确保这些系统的可信度和可解释性已成为关键问题。解决这一挑战的核心方法是归因,即追踪生成输出的来源文档。然而,由于LLMs可能产生不准确或不精确的响应,评估这些引用的可靠性至关重要。为此,我们提出两种技术:(1)零样本方法,将归因构建为简单的文本蕴含任务。我们使用flan-ul2的方法在AttributionBench的ID和OOD数据集上分别比最佳基线提高了0.27%和2.4%。(2)我们还探讨了注意力机制在增强归因过程中的作用。使用较小的LLM flan-t5-small时,除第4层及第8至11层外,F1分数在所有层均优于基线。


Prompting Large Language Models for Training-Free Non-Intrusive Load Monitoring

Abstract

arXiv:2505.06330v1 Announce Type: cross Abstract: Non-intrusive Load Monitoring (NILM) aims to disaggregate aggregate household electricity consumption into individual appliance usage, enabling more effective energy management. While deep learning has advanced NILM, it remains limited by its dependence on labeled data, restricted generalization, and lack of interpretability. In this paper, we introduce the first prompt-based NILM framework that leverages Large Language Models (LLMs) with in-context learning. We design and evaluate prompt strategies that integrate appliance features, timestamps and contextual information, as well as representative time-series examples, using the REDD dataset. With optimized prompts, LLMs achieve competitive state detection accuracy, reaching an average F1-score of 0.676 on unseen households, and demonstrate robust generalization without the need for fine-tuning. LLMs also enhance interpretability by providing clear, human-readable explanations for their predictions. Our results show that LLMs can reduce data requirements, improve adaptability, and provide transparent energy disaggregation in NILM applications.

摘要

非侵入式负荷监测(NILM)旨在将家庭总用电量分解为单个电器使用情况,以实现更有效的能源管理。尽管深度学习推动了NILM的发展,但其仍受限于对标注数据的依赖、泛化能力有限以及缺乏可解释性。本文首次提出基于提示的NILM框架,利用大型语言模型(LLMs)的上下文学习能力。我们设计并评估了整合电器特征、时间戳、上下文信息以及代表性时间序列示例的提示策略,使用REDD数据集进行验证。通过优化提示,LLMs在状态检测准确率上达到竞争性水平,在未见家庭数据上平均F1分数为0.676,且无需微调即展现出强大泛化能力。LLMs还能通过提供清晰、人类可读的预测解释来增强可解释性。实验结果表明,LLMs能够降低数据需求、提升适应性,并为NILM应用提供透明的能源分解方案。


Quantum State Preparation via Large-Language-Model-Driven Evolution

Abstract

arXiv:2505.06347v1 Announce Type: cross Abstract: We propose an automated framework for quantum circuit design by integrating large-language models (LLMs) with evolutionary optimization to overcome the rigidity, scalability limitations, and expert dependence of traditional ones in variational quantum algorithms. Our approach (FunSearch) autonomously discovers hardware-efficient ans"atze with new features of scalability and system-size-independent number of variational parameters entirely from scratch. Demonstrations on the Ising and XY spin chains with n = 9 qubits yield circuits containing 4 parameters, achieving near-exact energy extrapolation across system sizes. Implementations on quantum hardware (Zuchongzhi chip) validate practicality, where two-qubit quantum gate noises can be effectively mitigated via zero-noise extrapolations for a spin chain system as large as 20 sites. This framework bridges algorithmic design and experimental constraints, complementing contemporary quantum architecture search frameworks to advance scalable quantum simulations.

摘要

我们提出了一种自动化量子电路设计框架,通过将大语言模型(LLMs)与进化优化相结合,以克服变分量子算法中传统方法存在的刚性、可扩展性限制及专家依赖问题。该框架(FunSearch)能自主发现具有全新可扩展性特征且变分参数数量与系统规模无关的硬件高效ans"atze。在n=9量子位的Ising和XY自旋链上的演示实验表明,仅含4个参数的电路即可实现跨系统规模的近精确能量外推。在量子硬件(祖冲之芯片)上的实施验证了其实际可行性,其中针对多达20个位点的自旋链系统,通过零噪声外推可有效缓解双量子位门噪声。该框架 bridging 了算法设计与实验约束,与当代量子架构搜索框架形成互补,共同推动可扩展量子模拟的发展。


Towards AI-Driven Human-Machine Co-Teaming for Adaptive and Agile Cyber Security Operation Centers

Abstract

arXiv:2505.06394v1 Announce Type: cross Abstract: Security Operations Centers (SOCs) face growing challenges in managing cybersecurity threats due to an overwhelming volume of alerts, a shortage of skilled analysts, and poorly integrated tools. Human-AI collaboration offers a promising path to augment the capabilities of SOC analysts while reducing their cognitive overload. To this end, we introduce an AI-driven human-machine co-teaming paradigm that leverages large language models (LLMs) to enhance threat intelligence, alert triage, and incident response workflows. We present a vision in which LLM-based AI agents learn from human analysts the tacit knowledge embedded in SOC operations, enabling the AI agents to improve their performance on SOC tasks through this co-teaming. We invite SOCs to collaborate with us to further develop this process and uncover replicable patterns where human-AI co-teaming yields measurable improvements in SOC productivity.

摘要

安全运营中心(SOC)在应对网络安全威胁时面临日益严峻的挑战,包括警报数量激增、技术分析师短缺以及工具集成度不足等问题。人机协同为增强SOC分析师能力并减轻其认知负荷提供了可行路径。为此,我们提出一种基于人工智能的人机协作范式,利用大语言模型(LLM)强化威胁情报分析、警报分级和事件响应流程。我们构想LLM驱动的智能体能够从分析师处习得SOC运营中的隐性知识,通过这种协作模式提升其在SOC任务中的表现。诚邀各SOC机构与我们共同推进该研究,以发掘可复现的人机协作模式,从而显著提升安全运营中心的工作效能。


The ML.ENERGY Benchmark: Toward Automated Inference Energy Measurement and Optimization

Abstract

arXiv:2505.06371v1 Announce Type: cross Abstract: As the adoption of Generative AI in real-world services grow explosively, energy has emerged as a critical bottleneck resource. However, energy remains a metric that is often overlooked, under-explored, or poorly understood in the context of building ML systems. We present the ML.ENERGY Benchmark, a benchmark suite and tool for measuring inference energy consumption under realistic service environments, and the corresponding ML.ENERGY Leaderboard, which have served as a valuable resource for those hoping to understand and optimize the energy consumption of their generative AI services. In this paper, we explain four key design principles for benchmarking ML energy we have acquired over time, and then describe how they are implemented in the ML.ENERGY Benchmark. We then highlight results from the latest iteration of the benchmark, including energy measurements of 40 widely used model architectures across 6 different tasks, case studies of how ML design choices impact energy consumption, and how automated optimization recommendations can lead to significant (sometimes more than 40%) energy savings without changing what is being computed by the model. The ML.ENERGY Benchmark is open-source and can be easily extended to various customized models and application scenarios.

摘要

随着生成式人工智能在现实服务中的爆炸式应用,能源已成为关键瓶颈资源。然而在机器学习系统构建过程中,能源仍是一个常被忽视、缺乏深入探索或理解不足的指标。我们提出ML.ENERGY基准测试套件——一个用于测量实际服务环境下推理能耗的基准工具集,以及配套的ML.ENERGY排行榜,这些资源为理解并优化生成式AI服务能耗提供了重要参考。本文阐述了我们在长期实践中总结的机器学习能耗基准测试四项关键设计原则,并说明其如何在ML.ENERGY基准中实现。我们重点展示了最新基准测试结果,包括跨6类任务的40种常用模型架构能耗测量、机器学习设计选择对能耗影响的案例研究,以及自动化优化建议如何在不改变模型计算内容的情况下实现显著(有时超过40%)的节能效果。ML.ENERGY基准测试为开源项目,可轻松扩展至各类定制模型和应用场景。


Bi-LSTM based Multi-Agent DRL with Computation-aware Pruning for Agent Twins Migration in Vehicular Embodied AI Networks

Abstract

arXiv:2505.06378v1 Announce Type: cross Abstract: With the advancement of large language models and embodied Artificial Intelligence (AI) in the intelligent transportation scenarios, the combination of them in intelligent transportation spawns the Vehicular Embodied AI Network (VEANs). In VEANs, Autonomous Vehicles (AVs) are typical agents whose local advanced AI applications are defined as vehicular embodied AI agents, enabling capabilities such as environment perception and multi-agent collaboration. Due to computation latency and resource constraints, the local AI applications and services running on vehicular embodied AI agents need to be migrated, and subsequently referred to as vehicular embodied AI agent twins, which drive the advancement of vehicular embodied AI networks to offload intensive tasks to Roadside Units (RSUs), mitigating latency problems while maintaining service quality. Recognizing workload imbalance among RSUs in traditional approaches, we model AV-RSU interactions as a Stackelberg game to optimize bandwidth resource allocation for efficient migration. A Tiny Multi-Agent Bidirectional LSTM Proximal Policy Optimization (TMABLPPO) algorithm is designed to approximate the Stackelberg equilibrium through decentralized coordination. Furthermore, a personalized neural network pruning algorithm based on Path eXclusion (PX) dynamically adapts to heterogeneous AV computation capabilities by identifying task-critical parameters in trained models, reducing model complexity with less performance degradation. Experimental validation confirms the algorithm's effectiveness in balancing system load and minimizing delays, demonstrating significant improvements in vehicular embodied AI agent deployment.

摘要

随着大语言模型与具身人工智能在智能交通场景中的发展,二者结合催生了车载具身AI网络(VEANs)。在该网络中,自动驾驶车辆作为典型智能体,其本地高级AI应用被定义为车载具身AI代理,具备环境感知、多智能体协作等能力。受计算延迟与资源限制,车载具身AI代理运行的本地AI应用与服务需进行迁移(后称车载具身AI代理数字孪生),推动网络通过将密集任务卸载至路侧单元(RSU)来维持服务质量并缓解延迟问题。针对传统方法中RSU负载不均现象,我们将AV-RSU交互建模为Stackelberg博弈以优化带宽资源分配效率。设计了一种基于双向LSTM近端策略优化的微型多智能体算法(TMABLPPO),通过分布式协调逼近Stackelberg均衡。进一步提出基于路径排除(PX)的个性化神经网络剪枝算法,通过识别训练模型中任务关键参数动态适应异构AV算力,在降低模型复杂度的同时减少性能损失。实验验证表明该算法能有效平衡系统负载并降低延迟,显著提升了车载具身AI代理的部署效能。


Camera Control at the Edge with Language Models for Scene Understanding

Abstract

arXiv:2505.06402v1 Announce Type: cross Abstract: In this paper, we present Optimized Prompt-based Unified System (OPUS), a framework that utilizes a Large Language Model (LLM) to control Pan-Tilt-Zoom (PTZ) cameras, providing contextual understanding of natural environments. To achieve this goal, the OPUS system improves cost-effectiveness by generating keywords from a high-level camera control API and transferring knowledge from larger closed-source language models to smaller ones through Supervised Fine-Tuning (SFT) on synthetic data. This enables efficient edge deployment while maintaining performance comparable to larger models like GPT-4. OPUS enhances environmental awareness by converting data from multiple cameras into textual descriptions for language models, eliminating the need for specialized sensory tokens. In benchmark testing, our approach significantly outperformed both traditional language model techniques and more complex prompting methods, achieving a 35% improvement over advanced techniques and a 20% higher task accuracy compared to closed-source models like Gemini Pro. The system demonstrates OPUS's capability to simplify PTZ camera operations through an intuitive natural language interface. This approach eliminates the need for explicit programming and provides a conversational method for interacting with camera systems, representing a significant advancement in how users can control and utilize PTZ camera technology.

摘要

本文提出基于优化提示的统一系统框架OPUS,该框架利用大型语言模型(LLM)控制云台变焦(PTZ)摄像机,实现对自然环境的语境理解。为实现这一目标,OPUS系统通过高层摄像机控制API生成关键词,并在合成数据上进行监督微调(SFT),将大型闭源语言模型的知识迁移至较小模型,从而在保持与GPT-4等大型模型相当性能的同时,显著提升边缘部署的成本效益。该系统通过将多摄像机数据转换为语言模型可理解的文本描述,增强环境感知能力,无需专用传感令牌。基准测试表明,本方法在传统语言模型技术和复杂提示方法上均表现优异:相较先进技术实现35%的性能提升,与Gemini Pro等闭源模型相比任务准确率提高20%。OPUS系统通过直观的自然语言界面简化PTZ摄像机操作,无需显式编程即可实现对话式摄像机系统交互,标志着PTZ摄像技术用户控制方式的重大进步。


Natural Reflection Backdoor Attack on Vision Language Model for Autonomous Driving

Abstract

arXiv:2505.06413v1 Announce Type: cross Abstract: Vision-Language Models (VLMs) have been integrated into autonomous driving systems to enhance reasoning capabilities through tasks such as Visual Question Answering (VQA). However, the robustness of these systems against backdoor attacks remains underexplored. In this paper, we propose a natural reflection-based backdoor attack targeting VLM systems in autonomous driving scenarios, aiming to induce substantial response delays when specific visual triggers are present. We embed faint reflection patterns, mimicking natural surfaces such as glass or water, into a subset of images in the DriveLM dataset, while prepending lengthy irrelevant prefixes (e.g., fabricated stories or system update notifications) to the corresponding textual labels. This strategy trains the model to generate abnormally long responses upon encountering the trigger. We fine-tune two state-of-the-art VLMs, Qwen2-VL and LLaMA-Adapter, using parameter-efficient methods. Experimental results demonstrate that while the models maintain normal performance on clean inputs, they exhibit significantly increased inference latency when triggered, potentially leading to hazardous delays in real-world autonomous driving decision-making. Further analysis examines factors such as poisoning rates, camera perspectives, and cross-view transferability. Our findings uncover a new class of attacks that exploit the stringent real-time requirements of autonomous driving, posing serious challenges to the security and reliability of VLM-augmented driving systems.

摘要

视觉语言模型(VLMs)已被整合到自动驾驶系统中,通过视觉问答(VQA)等任务增强推理能力。然而,这些系统对后门攻击的鲁棒性仍未得到充分研究。本文提出一种基于自然反射的后门攻击方法,针对自动驾驶场景中的VLM系统,旨在当特定视觉触发器出现时诱发显著响应延迟。我们在DriveLM数据集的子集中嵌入模拟玻璃或水面等自然表面的微弱反射图案,同时在对应文本标签前添加冗长无关前缀(如虚构故事或系统更新通知)。该策略训练模型在遇到触发器时生成异常冗长的响应。我们采用参数高效方法对两种前沿VLM(Qwen2-VL和LLaMA-Adapter)进行微调。实验结果表明:模型在干净输入上保持正常性能,但被触发时推理延迟显著增加,可能导致实际自动驾驶决策中出现危险延误。进一步分析探讨了投毒率、摄像机视角及跨视图可迁移性等因素。本研究揭示了一类新型攻击,其利用自动驾驶严格的实时性要求,对VLM增强驾驶系统的安全性与可靠性构成严峻挑战。


xGen-small Technical Report

Abstract

arXiv:2505.06496v1 Announce Type: cross Abstract: We introduce xGen-small, a family of 4B and 9B Transformer decoder models optimized for long-context applications. Our vertically integrated pipeline unites domain-balanced, frequency-aware data curation; multi-stage pre-training with quality annealing and length extension to 128k tokens; and targeted post-training via supervised fine-tuning, preference learning, and online reinforcement learning. xGen-small delivers strong performance across various tasks, especially in math and coding domains, while excelling at long context benchmarks.

摘要

我们推出xGen-small系列模型,这是一组专为长上下文应用优化的40亿和90亿参数Transformer解码器。通过垂直整合的技术流程,我们实现了:基于领域平衡和频率感知的数据筛选;采用质量退火和128k标记长度扩展的多阶段预训练;以及通过监督微调、偏好学习和在线强化学习进行针对性后训练。xGen-small在各类任务中表现优异,尤其在数学与编程领域展现出强大性能,同时在长上下文基准测试中表现突出。


QoS-Efficient Serving of Multiple Mixture-of-Expert LLMs Using Partial Runtime Reconfiguration

Abstract

arXiv:2505.06481v1 Announce Type: cross Abstract: The deployment of mixture-of-experts (MoE) large language models (LLMs) presents significant challenges due to their high memory demands. These challenges become even more pronounced in multi-tenant environments, where shared resources must accommodate multiple models, limiting the effectiveness of conventional virtualization techniques. This paper addresses the problem of efficiently serving multiple fine-tuned MoE-LLMs on a single-GPU. We propose a serving system that employs \textit{similarity-based expert consolidation} to reduce the overall memory footprint by sharing similar experts across models. To ensure output quality, we introduce \textit{runtime partial reconfiguration}, dynamically replacing non-expert layers when processing requests from different models. As a result, our approach achieves a competitive output quality while maintaining throughput comparable to serving a single model while incurring a negligible increase in time-to-first-token (TTFT). Experiments on a server with a single NVIDIA A100 GPU (80GB) using Mixtral-8x7B models demonstrate an 85% average reduction in turnaround time compared to NVIDIA's multi-instance GPU (MIG). Furthermore, experiments on Google's Switch Transformer Base-8 model with up to four variants demonstrate the scalability and resilience of our approach in maintaining output quality compared to other model merging baselines, highlighting its effectiveness.

摘要

混合专家(MoE)大型语言模型(LLM)的部署因其高内存需求而面临重大挑战。在多租户环境中,这些挑战尤为突出,因为共享资源需同时承载多个模型,导致传统虚拟化技术的效能受限。本文研究了在单GPU上高效服务多个微调MoE-LLM的问题,提出一种采用基于相似性的专家整合的服务系统,通过跨模型共享相似专家来降低总体内存占用。为确保输出质量,我们引入运行时部分重配置机制,在处理不同模型的请求时动态替换非专家层。实验结果表明,该方法在保持与单模型服务相当的吞吐量同时,仅带来可忽略的首词生成时间(TTFT)增长,即可获得具有竞争力的输出质量。在配备NVIDIA A100 GPU(80GB)的服务器上使用Mixtral-8x7B模型进行的实验显示,相较于NVIDIA多实例GPU(MIG)方案,平均任务周转时间降低85%。此外,在谷歌Switch Transformer Base-8模型上进行的四变体实验表明,与其他模型合并基线相比,我们的方法在保持输出质量方面具有卓越的可扩展性和鲁棒性,充分验证了其有效性。


System Prompt Poisoning: Persistent Attacks on Large Language Models Beyond User Injection

Abstract

arXiv:2505.06493v1 Announce Type: cross Abstract: Large language models (LLMs) have gained widespread adoption across diverse applications due to their impressive generative capabilities. Their plug-and-play nature enables both developers and end users to interact with these models through simple prompts. However, as LLMs become more integrated into various systems in diverse domains, concerns around their security are growing. Existing studies mainly focus on threats arising from user prompts (e.g. prompt injection attack) and model output (e.g. model inversion attack), while the security of system prompts remains largely overlooked. This work bridges the critical gap. We introduce system prompt poisoning, a new attack vector against LLMs that, unlike traditional user prompt injection, poisons system prompts hence persistently impacts all subsequent user interactions and model responses. We systematically investigate four practical attack strategies in various poisoning scenarios. Through demonstration on both generative and reasoning LLMs, we show that system prompt poisoning is highly feasible without requiring jailbreak techniques, and effective across a wide range of tasks, including those in mathematics, coding, logical reasoning, and natural language processing. Importantly, our findings reveal that the attack remains effective even when user prompts employ advanced prompting techniques like chain-of-thought (CoT). We also show that such techniques, including CoT and retrieval-augmentation-generation (RAG), which are proven to be effective for improving LLM performance in a wide range of tasks, are significantly weakened in their effectiveness by system prompt poisoning.

摘要

大型语言模型(LLMs)因其卓越的生成能力已在多样化应用中得到广泛采用。其即插即用的特性使开发者和终端用户均可通过简单提示与模型交互。然而,随着LLMs在各领域系统中日益深入整合,其安全性问题逐渐凸显。现有研究主要关注用户提示(如提示注入攻击)和模型输出(如模型反演攻击)引发的威胁,而系统提示的安全性仍被严重忽视。本研究填补了这一关键空白。我们提出系统提示投毒这一新型LLM攻击向量——与传统用户提示注入不同,该攻击通过污染系统提示从而持续影响所有后续用户交互和模型响应。我们系统性地研究了多种投毒场景下的四种实用攻击策略。通过在生成型和推理型LLMs上的实证演示,表明系统提示投毒无需越狱技术即可高度可行,且能有效作用于数学、编程、逻辑推理和自然语言处理等广泛任务。值得注意的是,研究发现即使用户提示采用思维链(CoT)等高级提示技术,攻击依然有效。我们还证明,包括CoT和检索增强生成(RAG)在内的、已被证实能显著提升LLM多任务性能的技术,其有效性会因系统提示投毒而大幅削弱。


MacRAG: Compress, Slice, and Scale-up for Multi-Scale Adaptive Context RAG

Abstract

arXiv:2505.06569v1 Announce Type: cross Abstract: Long-context (LC) Large Language Models (LLMs) combined with Retrieval-Augmented Generation (RAG) hold strong potential for complex multi-hop and large-document tasks. However, existing RAG systems often suffer from imprecise retrieval, incomplete context coverage under constrained context windows, and fragmented information caused by suboptimal context construction. We introduce Multi-scale Adaptive Context RAG (MacRAG), a hierarchical retrieval framework that compresses and partitions documents into coarse-to-fine granularities, then adaptively merges relevant contexts through chunk- and document-level expansions in real time. By starting from the finest-level retrieval and progressively incorporating higher-level and broader context, MacRAG constructs effective query-specific long contexts, optimizing both precision and coverage. Evaluations on the challenging LongBench expansions of HotpotQA, 2WikiMultihopQA, and Musique confirm that MacRAG consistently surpasses baseline RAG pipelines on single- and multi-step generation with Llama-3.1-8B, Gemini-1.5-pro, and GPT-4o. Our results establish MacRAG as an efficient, scalable solution for real-world long-context, multi-hop reasoning. Our code is available at https://github.com/Leezekun/MacRAG.

摘要

长上下文(LC)大语言模型(LLMs)与检索增强生成(RAG)相结合,在复杂多跳和大文档任务中展现出强大潜力。然而,现有RAG系统常面临检索不精确、受限上下文窗口下的上下文覆盖不完整,以及次优上下文构建导致的信息碎片化问题。我们提出多尺度自适应上下文RAG(MacRAG),这是一种分层检索框架,通过将文档压缩并分割为从粗到细的粒度,随后实时通过块级和文档级扩展自适应合并相关上下文。MacRAG从最细粒度检索开始,逐步融入更高层次和更广范围的上下文,从而构建针对特定查询的有效长上下文,优化精度与覆盖率的平衡。在HotpotQA、2WikiMultihopQA和Musique的LongBench扩展基准上的评估表明,MacRAG在Llama-3.1-8B、Gemini-1.5-pro和GPT-4o模型上,无论是单步还是多步生成任务,均持续超越基线RAG流程。实验结果证明MacRAG是一种高效、可扩展的解决方案,适用于现实世界中的长上下文多跳推理任务。代码已开源:https://github.com/Leezekun/MacRAG。


Enfoque Odychess: Un m'etodo dial'ectico, constructivista y adaptativo para la ense~nanza del ajedrez con inteligencias artificiales generativas

Abstract

arXiv:2505.06652v1 Announce Type: cross Abstract: Chess teaching has evolved through different approaches, however, traditional methodologies, often based on memorization, contrast with the new possibilities offered by generative artificial intelligence, a technology still little explored in this field. This study seeks to empirically validate the effectiveness of the Odychess Approach in improving chess knowledge, strategic understanding, and metacognitive skills in students. A quasi-experimental study was conducted with a pre-test/post-test design and a control group (N=60). The experimental intervention implemented the Odychess Approach, incorporating a Llama 3.3 language model that was specifically adapted using Parameter-Efficient Fine-Tuning (PEFT) techniques to act as a Socratic chess tutor. Quantitative assessment instruments were used to measure chess knowledge, strategic understanding, and metacognitive skills before and after the intervention. The results of the quasi-experimental study showed significant improvements in the experimental group compared to the control group in the three variables analyzed: chess knowledge, strategic understanding, and metacognitive skills. The complementary qualitative analysis revealed greater analytical depth, more developed dialectical reasoning, and increased intrinsic motivation in students who participated in the Odychess method-based intervention. The Odychess Approach represents an effective pedagogical methodology for teaching chess, demonstrating the potential of the synergistic integration of constructivist and dialectical principles with generative artificial intelligence. The implications of this work are relevant for educators and institutions interested in adopting innovative pedagogical technologies and for researchers in the field of AI applied to education, highlighting the transferability of the language model adaptation methodology to other educational domains.

摘要

国际象棋教学历经多种方法演变,然而基于记忆的传统教学法与生成式人工智能提供的新可能性形成鲜明对比——该技术在此领域仍鲜少被探索。本研究旨在实证验证Odychess方法在提升学生象棋知识、战略理解及元认知技能方面的有效性。采用前测/后测设计的准实验研究(N=60)设置对照组,实验组实施整合Llama 3.3语言模型的Odychess方法,该模型通过参数高效微调技术(PEFT)专门适配为苏格拉底式象棋导师。使用定量评估工具测量干预前后象棋知识、战略理解和元认知技能的变化。准实验结果显示:实验组在三个分析变量(象棋知识、战略理解、元认知技能)上均较对照组有显著提升。补充质性分析表明,参与Odychess方法干预的学生表现出更强的分析深度、更完善的辩证思维及更高的内在动机。Odychess方法代表了一种有效的象棋教学法,证明了建构主义与辩证原则同生成式人工智能协同整合的潜力。本研究对关注创新教育技术的教育者与机构、以及人工智能教育应用领域的研究者具有重要启示,尤其体现在语言模型适配方法向其他教育领域的可迁移性价值。


ThreatLens: LLM-guided Threat Modeling and Test Plan Generation for Hardware Security Verification

Abstract

arXiv:2505.06821v1 Announce Type: cross Abstract: Current hardware security verification processes predominantly rely on manual threat modeling and test plan generation, which are labor-intensive, error-prone, and struggle to scale with increasing design complexity and evolving attack methodologies. To address these challenges, we propose ThreatLens, an LLM-driven multi-agent framework that automates security threat modeling and test plan generation for hardware security verification. ThreatLens integrates retrieval-augmented generation (RAG) to extract relevant security knowledge, LLM-powered reasoning for threat assessment, and interactive user feedback to ensure the generation of practical test plans. By automating these processes, the framework reduces the manual verification effort, enhances coverage, and ensures a structured, adaptable approach to security verification. We evaluated our framework on the NEORV32 SoC, demonstrating its capability to automate security verification through structured test plans and validating its effectiveness in real-world scenarios.

摘要

当前硬件安全验证流程主要依赖人工威胁建模和测试计划生成,这些方法不仅劳动密集、容易出错,而且难以应对日益增长的设计复杂度和不断演变的攻击方法。为解决这些挑战,我们提出ThreatLens——一个基于大语言模型(LLM)的多智能体框架,可自动化完成硬件安全验证中的威胁建模和测试计划生成。该框架整合了检索增强生成(RAG)技术以提取相关安全知识,利用LLM驱动的推理进行威胁评估,并通过交互式用户反馈确保生成具有实用性的测试方案。通过自动化这些流程,该框架显著降低了人工验证工作量,提升了覆盖范围,并确保采用结构化、可适配的安全验证方法。我们在NEORV32 SoC上对该框架进行了评估,结果表明其能通过结构化测试方案实现安全验证自动化,并在实际场景中验证了有效性。


The power of fine-grained experts: Granularity boosts expressivity in Mixture of Experts

Abstract

arXiv:2505.06839v1 Announce Type: cross Abstract: Mixture-of-Experts (MoE) layers are increasingly central to frontier model architectures. By selectively activating parameters, they reduce computational cost while scaling total parameter count. This paper investigates the impact of the number of active experts, termed granularity, comparing architectures with many (e.g., 8 per layer in DeepSeek) to those with fewer (e.g., 1 per layer in Llama-4 models). We prove an exponential separation in network expressivity based on this design parameter, suggesting that models benefit from higher granularity. Experimental results corroborate our theoretical findings and illustrate this separation.

摘要

混合专家(MoE)层在前沿模型架构中日益重要。通过选择性激活参数,它们在扩大参数总量的同时降低了计算成本。本文研究了活跃专家数量(称为粒度)的影响,比较了每层使用较多专家(如DeepSeek模型每层8个)与较少专家(如Llama-4模型每层1个)的架构。我们证明该设计参数会导致网络表达能力的指数级差异,表明模型受益于更高的粒度。实验结果验证了理论发现并展示了这种差异。


RedTeamLLM: an Agentic AI framework for offensive security

Abstract

arXiv:2505.06913v1 Announce Type: cross Abstract: From automated intrusion testing to discovery of zero-day attacks before software launch, agentic AI calls for great promises in security engineering. This strong capability is bound with a similar threat: the security and research community must build up its models before the approach is leveraged by malicious actors for cybercrime. We therefore propose and evaluate RedTeamLLM, an integrated architecture with a comprehensive security model for automatization of pentest tasks. RedTeamLLM follows three key steps: summarizing, reasoning and act, which embed its operational capacity. This novel framework addresses four open challenges: plan correction, memory management, context window constraint, and generality vs. specialization. Evaluation is performed through the automated resolution of a range of entry-level, but not trivial, CTF challenges. The contribution of the reasoning capability of our agentic AI framework is specifically evaluated.

摘要

从自动化渗透测试到软件发布前的零日漏洞发现,智能体人工智能为安全工程领域带来了巨大前景。这种强大能力伴随着同等程度的威胁:安全研究界必须在该技术被恶意行为者用于网络犯罪前建立相应防御模型。为此,我们提出并评估了RedTeamLLM——一种集成架构,其包含实现渗透测试任务自动化的综合安全模型。RedTeamLLM遵循三个关键步骤:总结、推理与执行,这些步骤构成了其操作能力。该新型框架解决了四大开放挑战:计划修正、记忆管理、上下文窗口限制以及通用性与专业性的平衡。通过自动化解决一系列入门级但非简单的CTF挑战进行性能评估,特别验证了我们智能体AI框架中推理能力的贡献。


Convert Language Model into a Value-based Strategic Planner

Abstract

arXiv:2505.06987v1 Announce Type: cross Abstract: Emotional support conversation (ESC) aims to alleviate the emotional distress of individuals through effective conversations. Although large language models (LLMs) have obtained remarkable progress on ESC, most of these studies might not define the diagram from the state model perspective, therefore providing a suboptimal solution for long-term satisfaction. To address such an issue, we leverage the Q-learning on LLMs, and propose a framework called straQ*. Our framework allows a plug-and-play LLM to bootstrap the planning during ESC, determine the optimal strategy based on long-term returns, and finally guide the LLM to response. Substantial experiments on ESC datasets suggest that straQ* outperforms many baselines, including direct inference, self-refine, chain of thought, finetuning, and finite state machines.

摘要

情感支持对话(ESC)旨在通过有效交流缓解个体的情绪困扰。尽管大型语言模型(LLM)在ESC领域取得了显著进展,但现有研究大多未能从状态模型视角明确定义对话流程,导致长期满意度优化不足。针对该问题,我们提出在LLM上应用Q学习算法,构建名为straQ的框架。该框架支持即插即用的LLM在ESC过程中自主规划策略,基于长期回报确定最优方案,并最终引导LLM生成响应。在多个ESC数据集上的大量实验表明,straQ的性能优于直接推理、自我优化、思维链、微调及有限状态机等基线方法。


IM-BERT: Enhancing Robustness of BERT through the Implicit Euler Method

Abstract

arXiv:2505.06889v1 Announce Type: cross Abstract: Pre-trained Language Models (PLMs) have achieved remarkable performance on diverse NLP tasks through pre-training and fine-tuning. However, fine-tuning the model with a large number of parameters on limited downstream datasets often leads to vulnerability to adversarial attacks, causing overfitting of the model on standard datasets. To address these issues, we propose IM-BERT from the perspective of a dynamic system by conceptualizing a layer of BERT as a solution of Ordinary Differential Equations (ODEs). Under the situation of initial value perturbation, we analyze the numerical stability of two main numerical ODE solvers: the explicit and implicit Euler approaches. Based on these analyses, we introduce a numerically robust IM-connection incorporating BERT's layers. This strategy enhances the robustness of PLMs against adversarial attacks, even in low-resource scenarios, without introducing additional parameters or adversarial training strategies. Experimental results on the adversarial GLUE (AdvGLUE) dataset validate the robustness of IM-BERT under various conditions. Compared to the original BERT, IM-BERT exhibits a performance improvement of approximately 8.3%p on the AdvGLUE dataset. Furthermore, in low-resource scenarios, IM-BERT outperforms BERT by achieving 5.9%p higher accuracy.

摘要

预训练语言模型(PLMs)通过预训练和微调在多样化自然语言处理任务中取得了显著性能。然而,在有限下游数据集上对参数量庞大的模型进行微调时,往往会导致模型易受对抗攻击影响,从而在标准数据集上出现过拟合现象。针对这些问题,我们从动态系统视角提出IM-BERT模型,将BERT的某一层概念化为常微分方程(ODEs)的数值解。在初始值扰动情况下,我们分析了两种主要ODE数值解法(显式和隐式欧拉方法)的数值稳定性。基于这些分析,我们提出了一种结合BERT层结构的数值鲁棒性IM连接策略。该方法无需引入额外参数或对抗训练策略,即可增强预训练语言模型在低资源场景下对抗攻击的鲁棒性。在对抗性GLUE(AdvGLUE)数据集上的实验结果表明,IM-BERT在不同条件下均表现出优越的鲁棒性。相较于原始BERT模型,IM-BERT在AdvGLUE数据集上实现了约8.3%的性能提升。此外,在低资源场景中,IM-BERT以5.9%的准确率优势超越BERT模型。


The Wisdom of Agent Crowds: A Human-AI Interaction Innovation Ignition Framework

Abstract

arXiv:2505.06947v1 Announce Type: cross Abstract: With the widespread application of large AI models in various fields, the automation level of multi-agent systems has been continuously improved. However, in high-risk decision-making scenarios such as healthcare and finance, human participation and the alignment of intelligent systems with human intentions remain crucial. This paper focuses on the financial scenario and constructs a multi-agent brainstorming framework based on the BDI theory. A human-computer collaborative multi-agent financial analysis process is built using Streamlit. The system plans tasks according to user intentions, reduces users' cognitive load through real-time updated structured text summaries and the interactive Cothinker module, and reasonably integrates general and reasoning large models to enhance the ability to handle complex problems. By designing a quantitative analysis algorithm for the sentiment tendency of interview content based on LLMs and a method for evaluating the diversity of ideas generated by LLMs in brainstorming based on k-means clustering and information entropy, the system is comprehensively evaluated. The results of human factors testing show that the system performs well in terms of usability and user experience. Although there is still room for improvement, it can effectively support users in completing complex financial tasks. The research shows that the system significantly improves the efficiency of human-computer interaction and the quality of decision-making in financial decision-making scenarios, providing a new direction for the development of related fields.

摘要

随着大型AI模型在各领域的广泛应用,多智能体系统的自动化水平持续提升。然而在医疗、金融等高风险决策场景中,人类参与及智能系统与人类意图的对齐仍至关重要。本文聚焦金融场景,基于BDI理论构建多智能体头脑风暴框架,利用Streamlit搭建人机协同的多智能体金融分析流程。该系统根据用户意图规划任务,通过实时更新的结构化文本摘要和交互式Cothinker模块降低用户认知负荷,并合理整合通用大模型与推理大模型以提升复杂问题处理能力。通过设计基于LLM的访谈内容情感倾向量化分析算法,以及基于k-means聚类与信息熵的LLM头脑风暴观点多样性评估方法,对系统进行综合评估。人因测试结果表明,系统在可用性和用户体验方面表现良好,虽仍有改进空间,但能有效支持用户完成复杂金融任务。研究表明,该系统显著提升了金融决策场景中人机交互效率和决策质量,为相关领域发展提供了新方向。


Can LLM-based Financial Investing Strategies Outperform the Market in Long Run?

Abstract

arXiv:2505.07078v1 Announce Type: cross Abstract: Large Language Models (LLMs) have recently been leveraged for asset pricing tasks and stock trading applications, enabling AI agents to generate investment decisions from unstructured financial data. However, most evaluations of LLM timing-based investing strategies are conducted on narrow timeframes and limited stock universes, overstating effectiveness due to survivorship and data-snooping biases. We critically assess their generalizability and robustness by proposing FINSABER, a backtesting framework evaluating timing-based strategies across longer periods and a larger universe of symbols. Systematic backtests over two decades and 100+ symbols reveal that previously reported LLM advantages deteriorate significantly under broader cross-section and over a longer-term evaluation. Our market regime analysis further demonstrates that LLM strategies are overly conservative in bull markets, underperforming passive benchmarks, and overly aggressive in bear markets, incurring heavy losses. These findings highlight the need to develop LLM strategies that are able to prioritise trend detection and regime-aware risk controls over mere scaling of framework complexity.

摘要

大型语言模型(LLMs)近期被应用于资产定价任务和股票交易领域,使人工智能代理能够从非结构化金融数据生成投资决策。然而,现有基于LLM择时投资策略的评估大多在狭窄时间范围和有限股票池中进行,由于生存偏差和数据窥探偏差导致有效性被高估。我们通过提出FINSABER回测框架,在更长时间跨度和更大标的范围内评估择时策略,对其泛化性和鲁棒性进行批判性检验。跨越二十年、涵盖100余种标的的系统性回测表明,先前报道的LLM优势在更广泛横截面和长期评估中显著减弱。我们的市场状态分析进一步揭示:LLM策略在牛市过度保守导致跑输被动基准,在熊市过度激进造成重大亏损。这些发现强调,需要开发能够优先考虑趋势检测和状态感知风险控制,而非单纯扩大框架复杂度的LLM策略。


ParaView-MCP: An Autonomous Visualization Agent with Direct Tool Use

Abstract

arXiv:2505.07064v1 Announce Type: cross Abstract: While powerful and well-established, tools like ParaView present a steep learning curve that discourages many potential users. This work introduces ParaView-MCP, an autonomous agent that integrates modern multimodal large language models (MLLMs) with ParaView to not only lower the barrier to entry but also augment ParaView with intelligent decision support. By leveraging the state-of-the-art reasoning, command execution, and vision capabilities of MLLMs, ParaView-MCP enables users to interact with ParaView through natural language and visual inputs. Specifically, our system adopted the Model Context Protocol (MCP) - a standardized interface for model-application communication - that facilitates direct interaction between MLLMs with ParaView's Python API to allow seamless information exchange between the user, the language model, and the visualization tool itself. Furthermore, by implementing a visual feedback mechanism that allows the agent to observe the viewport, we unlock a range of new capabilities, including recreating visualizations from examples, closed-loop visualization parameter updates based on user-defined goals, and even cross-application collaboration involving multiple tools. Broadly, we believe such an agent-driven visualization paradigm can profoundly change the way we interact with visualization tools. We expect a significant uptake in the development of such visualization tools, in both visualization research and industry.

摘要

虽然功能强大且技术成熟,但ParaView等工具陡峭的学习曲线阻碍了许多潜在用户的使用。本研究提出ParaView-MCP——一种将现代多模态大语言模型(MLLMs)与ParaView相结合的自主体,该系统不仅能降低使用门槛,还能通过智能决策支持增强ParaView功能。通过利用MLLMs最先进的推理能力、命令执行能力和视觉能力,ParaView-MCP支持用户通过自然语言和视觉输入与ParaView交互。具体而言,我们采用模型上下文协议(MCP)——一种模型与应用通信的标准化接口——使MLLMs能够直接调用ParaView的Python API,实现用户、语言模型与可视化工具间的无缝信息交换。此外,通过实现可观察视口的视觉反馈机制,我们解锁了一系列新功能:包括根据示例重建可视化、基于用户定义目标的闭环可视化参数更新,甚至支持多工具参与的跨应用协作。总体而言,我们认为这种由智能体驱动的可视化范式将深刻改变我们与可视化工具的交互方式。预计此类可视化工具在可视化研究领域和产业界都将得到广泛应用。


Seed1.5-VL Technical Report

Abstract

arXiv:2505.07062v1 Announce Type: cross Abstract: We present Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning. Seed1.5-VL is composed with a 532M-parameter vision encoder and a Mixture-of-Experts (MoE) LLM of 20B active parameters. Despite its relatively compact architecture, it delivers strong performance across a wide spectrum of public VLM benchmarks and internal evaluation suites, achieving the state-of-the-art performance on 38 out of 60 public benchmarks. Moreover, in agent-centric tasks such as GUI control and gameplay, Seed1.5-VL outperforms leading multimodal systems, including OpenAI CUA and Claude 3.7. Beyond visual and video understanding, it also demonstrates strong reasoning abilities, making it particularly effective for multimodal reasoning challenges such as visual puzzles. We believe these capabilities will empower broader applications across diverse tasks. In this report, we mainly provide a comprehensive review of our experiences in building Seed1.5-VL across model design, data construction, and training at various stages, hoping that this report can inspire further research. Seed1.5-VL is now accessible at https://www.volcengine.com/ (Volcano Engine Model ID: doubao-1-5-thinking-vision-pro-250428)

摘要

我们推出Seed1.5-VL视觉语言基础模型,旨在推进通用多模态理解与推理能力。该模型由5.32亿参数的视觉编码器和200亿激活参数的专家混合(MoE)大语言模型构成,虽架构相对紧凑,却在广泛公共视觉语言基准及内部评估体系中表现优异,在60项公共基准测试中有38项达到最先进水平。在GUI控制和游戏等以智能体为中心的任务中,Seed1.5-VL表现超越包括OpenAI CUA和Claude 3.7在内的领先多模态系统。除视觉与视频理解外,该模型还展现出强大的推理能力,尤其擅长视觉谜题等多模态推理挑战。我们相信这些能力将赋能更广泛的任务应用。本报告重点从模型设计、数据构建及分阶段训练等方面系统总结Seed1.5-VL的研发经验,以期推动后续研究。该模型现已通过火山引擎平台开放访问(模型ID:doubao-1-5-thinking-vision-pro-250428)。


UMoE: Unifying Attention and FFN with Shared Experts

Abstract

arXiv:2505.07260v1 Announce Type: cross Abstract: Sparse Mixture of Experts (MoE) architectures have emerged as a promising approach for scaling Transformer models. While initial works primarily incorporated MoE into feed-forward network (FFN) layers, recent studies have explored extending the MoE paradigm to attention layers to enhance model performance. However, existing attention-based MoE layers require specialized implementations and demonstrate suboptimal performance compared to their FFN-based counterparts. In this paper, we aim to unify the MoE designs in attention and FFN layers by introducing a novel reformulation of the attention mechanism, revealing an underlying FFN-like structure within attention modules. Our proposed architecture, UMoE, achieves superior performance through attention-based MoE layers while enabling efficient parameter sharing between FFN and attention components.

摘要

稀疏专家混合(MoE)架构已成为扩展Transformer模型的一种有效方法。早期研究主要将MoE应用于前馈网络(FFN)层,而近期工作则尝试将MoE范式扩展到注意力层以提升模型性能。然而,现有基于注意力的MoE层需要专门实现,且性能表现逊于基于FFN的对应结构。本文通过提出注意力机制的新颖重构方案,揭示了注意力模块中潜在的类FFN结构,旨在统一注意力层与FFN层的MoE设计。我们提出的UMoE架构通过基于注意力的MoE层实现了卓越性能,同时支持FFN与注意力组件间的高效参数共享。


DynamicRAG: Leveraging Outputs of Large Language Model as Feedback for Dynamic Reranking in Retrieval-Augmented Generation

Abstract

arXiv:2505.07233v1 Announce Type: cross Abstract: Retrieval-augmented generation (RAG) systems combine large language models (LLMs) with external knowledge retrieval, making them highly effective for knowledge-intensive tasks. A crucial but often under-explored component of these systems is the reranker, which refines retrieved documents to enhance generation quality and explainability. The challenge of selecting the optimal number of documents (k) remains unsolved: too few may omit critical information, while too many introduce noise and inefficiencies. Although recent studies have explored LLM-based rerankers, they primarily leverage internal model knowledge and overlook the rich supervisory signals that LLMs can provide, such as using response quality as feedback for optimizing reranking decisions. In this paper, we propose DynamicRAG, a novel RAG framework where the reranker dynamically adjusts both the order and number of retrieved documents based on the query. We model the reranker as an agent optimized through reinforcement learning (RL), using rewards derived from LLM output quality. Across seven knowledge-intensive datasets, DynamicRAG demonstrates superior performance, achieving state-of-the-art results. The model, data and code are available at https://github.com/GasolSun36/DynamicRAG

摘要

检索增强生成(RAG)系统通过将大语言模型(LLM)与外部知识检索相结合,在知识密集型任务中展现出卓越效能。这些系统中一个关键但常被忽视的组件是重排序器,其通过优化检索文档提升生成质量与可解释性。如何选择最优文档数量(k)仍是未解难题:过少可能遗漏关键信息,过多则引入噪声并降低效率。尽管近期研究探索了基于LLM的重排序器,但它们主要利用模型内部知识,忽视了LLM可提供的丰富监督信号(例如将响应质量作为优化重排序决策的反馈)。本文提出DynamicRAG——一种新型RAG框架,其重排序器能根据查询动态调整检索文档的顺序和数量。我们将重排序器建模为通过强化学习(RL)优化的智能体,其奖励函数源自LLM输出质量。在七个知识密集型数据集上的实验表明,DynamicRAG性能优越,达到了最先进水平。模型、数据及代码已开源:https://github.com/GasolSun36/DynamicRAG


Comet: Accelerating Private Inference for Large Language Model by Predicting Activation Sparsity

Abstract

arXiv:2505.07239v1 Announce Type: cross Abstract: With the growing use of large language models (LLMs) hosted on cloud platforms to offer inference services, privacy concerns about the potential leakage of sensitive information are escalating. Secure multi-party computation (MPC) is a promising solution to protect the privacy in LLM inference. However, MPC requires frequent inter-server communication, causing high performance overhead. Inspired by the prevalent activation sparsity of LLMs, where most neuron are not activated after non-linear activation functions, we propose an efficient private inference system, Comet. This system employs an accurate and fast predictor to predict the sparsity distribution of activation function output. Additionally, we introduce a new private inference protocol. It efficiently and securely avoids computations involving zero values by exploiting the spatial locality of the predicted sparse distribution. While this computation-avoidance approach impacts the spatiotemporal continuity of KV cache entries, we address this challenge with a low-communication overhead cache refilling strategy that merges miss requests and incorporates a prefetching mechanism. Finally, we evaluate Comet on four common LLMs and compare it with six state-of-the-art private inference systems. Comet achieves a 1.87x-2.63x speedup and a 1.94x-2.64x communication reduction.

摘要

随着基于云平台的大型语言模型(LLMs)推理服务日益普及,关于敏感信息潜在泄露的隐私担忧不断加剧。安全多方计算(MPC)是保护LLM推理隐私的有效方案,但其频繁的服务器间通信会导致高昂性能开销。受LLMs普遍存在的激活稀疏性启发(即大多数神经元经过非线性激活函数后未被激活),我们提出高效隐私推理系统Comet。该系统采用精确快速的预测器来预判激活函数输出的稀疏分布,并创新性地提出新型隐私推理协议:通过利用预测稀疏分布的空间局部性,安全高效地规避零值相关计算。虽然这种计算规避策略会影响KV缓存项的时空连续性,但我们通过合并缺失请求与预取机制的低通信开销缓存重填策略解决了该问题。最终在四种常见LLMs上的实验表明,相比六种最先进的隐私推理系统,Comet实现了1.87-2.63倍加速和1.94-2.64倍通信缩减。


UAV-CodeAgents: Scalable UAV Mission Planning via Multi-Agent ReAct and Vision-Language Reasoning

Abstract

arXiv:2505.07236v1 Announce Type: cross Abstract: We present UAV-CodeAgents, a scalable multi-agent framework for autonomous UAV mission generation, built on large language and vision-language models (LLMs/VLMs). The system leverages the ReAct (Reason + Act) paradigm to interpret satellite imagery, ground high-level natural language instructions, and collaboratively generate UAV trajectories with minimal human supervision. A core component is a vision-grounded, pixel-pointing mechanism that enables precise localization of semantic targets on aerial maps. To support real-time adaptability, we introduce a reactive thinking loop, allowing agents to iteratively reflect on observations, revise mission goals, and coordinate dynamically in evolving environments. UAV-CodeAgents is evaluated on large-scale mission scenarios involving industrial and environmental fire detection. Our results show that a lower decoding temperature (0.5) yields higher planning reliability and reduced execution time, with an average mission creation time of 96.96 seconds and a success rate of 93%. We further fine-tune Qwen2.5VL-7B on 9,000 annotated satellite images, achieving strong spatial grounding across diverse visual categories. To foster reproducibility and future research, we will release the full codebase and a novel benchmark dataset for vision-language-based UAV planning.

摘要

我们提出UAV-CodeAgents——一个基于大语言模型与视觉语言模型(LLMs/VLMs)构建的、可扩展的多智能体自主无人机任务生成框架。该系统采用"推理+执行"(ReAct)范式,通过解析卫星图像与地面高层级自然语言指令,在最小化人工干预的情况下协同生成无人机航迹。其核心组件是具备视觉 grounding 能力的像素级指向机制,可实现航空地图上语义目标的精确定位。为支持实时适应性,我们引入了反应式思维循环机制,使智能体能够在动态环境中迭代反思观测数据、修正任务目标并进行协同决策。该框架在工业与环境火灾检测的大规模任务场景中完成评估,结果表明较低的解码温度(0.5)可带来更高规划可靠性(任务成功率93%)与更短执行时间(平均任务生成耗时96.96秒)。我们进一步基于9,000张标注卫星图像对Qwen2.5VL-7B模型进行微调,在多样化视觉类别中实现了强空间 grounding 能力。为促进可复现性与后续研究,我们将公开完整代码库及面向视觉语言无人机规划的新型基准数据集。


No Query, No Access

Abstract

arXiv:2505.07258v1 Announce Type: cross Abstract: Textual adversarial attacks mislead NLP models, including Large Language Models (LLMs), by subtly modifying text. While effective, existing attacks often require knowledge of the victim model, extensive queries, or access to training data, limiting real-world feasibility. To overcome these constraints, we introduce the \textbf{Victim Data-based Adversarial Attack (VDBA)}, which operates using only victim texts. To prevent access to the victim model, we create a shadow dataset with publicly available pre-trained models and clustering methods as a foundation for developing substitute models. To address the low attack success rate (ASR) due to insufficient information feedback, we propose the hierarchical substitution model design, generating substitute models to mitigate the failure of a single substitute model at the decision boundary. Concurrently, we use diverse adversarial example generation, employing various attack methods to generate and select the adversarial example with better similarity and attack effectiveness. Experiments on the Emotion and SST5 datasets show that VDBA outperforms state-of-the-art methods, achieving an ASR improvement of 52.08% while significantly reducing attack queries to 0. More importantly, we discover that VDBA poses a significant threat to LLMs such as Qwen2 and the GPT family, and achieves the highest ASR of 45.99% even without access to the API, confirming that advanced NLP models still face serious security risks. Our codes can be found at https://anonymous.4open.science/r/VDBA-Victim-Data-based-Adversarial-Attack-36EC/

摘要

文本对抗攻击通过细微修改文本误导包括大语言模型(LLM)在内的NLP模型。尽管现有攻击方法有效,但它们通常需要了解受害者模型、大量查询或访问训练数据,限制了实际可行性。为突破这些限制,我们提出基于受害者数据的对抗攻击(VDBA),该方法仅需利用受害者文本即可实施攻击。为防止访问受害者模型,我们使用公开预训练模型和聚类方法构建影子数据集,作为开发替代模型的基础。针对信息反馈不足导致的低攻击成功率(ASR),我们提出分层替代模型设计,通过生成多个替代模型来缓解单一模型在决策边界失效的问题。同时采用多样化对抗样本生成策略,综合多种攻击方法生成并筛选相似性与攻击效果更优的对抗样本。在Emotion和SST5数据集上的实验表明,VDBA以52.08%的ASR提升超越现有最优方法,且将攻击查询量降为0。更重要的是,我们发现VDBA对Qwen2和GPT系列等大语言模型构成显著威胁,在无法访问API的情况下仍实现45.99%的最高ASR,证实先进NLP模型仍面临严重安全风险。


Semantic Retention and Extreme Compression in LLMs: Can We Have Both?

Abstract

arXiv:2505.07289v1 Announce Type: cross Abstract: The exponential growth in Large Language Model (LLM) deployment has intensified the need for efficient model compression techniques to reduce computational and memory costs. While pruning and quantization have shown promise, their combined potential remains largely unexplored. In this paper, we examine joint compression and how strategically combining pruning and quantization could yield superior performance-to-compression ratios compared to single-method approaches. Recognizing the challenges in accurately assessing LLM performance, we address key limitations of previous evaluation frameworks and introduce the Semantic Retention Compression Rate (SrCr), a novel metric that quantifies the trade-off between model compression and semantic preservation, facilitating the optimization of pruning-quantization configurations. Experiments demonstrate that our recommended combination achieves, on average, a 20% performance increase compared to an equivalent quantization-only model at the same theoretical compression rate.

摘要

大型语言模型(LLM)部署的指数级增长加剧了对高效模型压缩技术的需求,以降低计算和内存成本。尽管剪枝和量化已展现出潜力,但二者的联合应用潜力仍 largely 未被探索。本文研究了联合压缩策略,论证了通过有机结合剪枝与量化,相较于单一方法可获得更优的性能-压缩比。针对LLM性能评估的固有挑战,我们指出了现有评估框架的关键局限,并提出"语义保持压缩率"(SrCr)这一创新指标——该指标量化了模型压缩与语义保留之间的权衡关系,为剪枝-量化配置的优化提供了依据。实验表明,在相同理论压缩率下,我们推荐的组合方案平均比纯量化模型性能提升20%。


SAS-Bench: A Fine-Grained Benchmark for Evaluating Short Answer Scoring with Large Language Models

Abstract

arXiv:2505.07247v1 Announce Type: cross Abstract: Subjective Answer Grading (SAG) plays a crucial role in education, standardized testing, and automated assessment systems, particularly for evaluating short-form responses in Short Answer Scoring (SAS). However, existing approaches often produce coarse-grained scores and lack detailed reasoning. Although large language models (LLMs) have demonstrated potential as zero-shot evaluators, they remain susceptible to bias, inconsistencies with human judgment, and limited transparency in scoring decisions. To overcome these limitations, we introduce SAS-Bench, a benchmark specifically designed for LLM-based SAS tasks. SAS-Bench provides fine-grained, step-wise scoring, expert-annotated error categories, and a diverse range of question types derived from real-world subject-specific exams. This benchmark facilitates detailed evaluation of model reasoning processes and explainability. We also release an open-source dataset containing 1,030 questions and 4,109 student responses, each annotated by domain experts. Furthermore, we conduct comprehensive experiments with various LLMs, identifying major challenges in scoring science-related questions and highlighting the effectiveness of few-shot prompting in improving scoring accuracy. Our work offers valuable insights into the development of more robust, fair, and educationally meaningful LLM-based evaluation systems.

摘要

主观题评分(SAG)在教育、标准化考试和自动化评估系统中具有关键作用,尤其适用于短答案评分(SAS)中对简答式作答的评估。然而现有方法通常仅生成粗粒度分数且缺乏详细推理过程。尽管大语言模型(LLMs)已展现出作为零样本评估者的潜力,但其仍易受偏见影响、与人类判断存在不一致性,且评分决策透明度有限。为克服这些局限,我们提出SAS-Bench——专为基于LLM的SAS任务设计的基准测试。该基准提供细粒度的分步评分、专家标注的错误类型分类,以及源自真实学科考试的多样化题型,可支持对模型推理过程与可解释性的深度评估。我们同时开源包含1,030道试题和4,109份学生作答的数据集,所有数据均经由领域专家标注。通过多种LLM的全面实验,我们揭示了科学类试题评分的主要挑战,并证明小样本提示能有效提升评分准确性。本研究为开发更稳健、公平且具有教育意义的LLM评估系统提供了重要见解。


Towards Multi-Agent Reasoning Systems for Collaborative Expertise Delegation: An Exploratory Design Study

Abstract

arXiv:2505.07313v1 Announce Type: cross Abstract: Designing effective collaboration structure for multi-agent LLM systems to enhance collective reasoning is crucial yet remains under-explored. In this paper, we systematically investigate how collaborative reasoning performance is affected by three key design dimensions: (1) Expertise-Domain Alignment, (2) Collaboration Paradigm (structured workflow vs. diversity-driven integration), and (3) System Scale. Our findings reveal that expertise alignment benefits are highly domain-contingent, proving most effective for contextual reasoning tasks. Furthermore, collaboration focused on integrating diverse knowledge consistently outperforms rigid task decomposition. Finally, we empirically explore the impact of scaling the multi-agent system with expertise specialization and study the computational trade off, highlighting the need for more efficient communication protocol design. This work provides concrete guidelines for configuring specialized multi-agent system and identifies critical architectural trade-offs and bottlenecks for scalable multi-agent reasoning. The code will be made available upon acceptance.

摘要

设计有效的多智能体大语言模型系统协作结构以增强集体推理能力至关重要,但目前仍未得到充分探索。本文系统研究了协作推理性能受三个关键设计维度的影响:(1)专业领域对齐,(2)协作范式(结构化工作流与多样性驱动整合),以及(3)系统规模。研究发现专业对齐的效益具有高度领域依赖性,在上下文推理任务中效果最为显著。此外,注重整合多元知识的协作方式始终优于刚性任务分解。最后,我们通过实证研究了具有专业特化的多智能体系统规模扩展影响及计算权衡,强调需要设计更高效的通信协议。本研究为配置专业化多智能体系统提供了具体指导,并指出了可扩展多智能体推理的关键架构权衡与瓶颈。代码将在论文录用后公开。


INTELLECT-2: A Reasoning Model Trained Through Globally Decentralized Reinforcement Learning

Abstract

arXiv:2505.07291v1 Announce Type: cross Abstract: We introduce INTELLECT-2, the first globally distributed reinforcement learning (RL) training run of a 32 billion parameter language model. Unlike traditional centralized training efforts, INTELLECT-2 trains a reasoning model using fully asynchronous RL across a dynamic, heterogeneous swarm of permissionless compute contributors. To enable a training run with this unique infrastructure, we built various components from scratch: we introduce PRIME-RL, our training framework purpose-built for distributed asynchronous reinforcement learning, based on top of novel components such as TOPLOC, which verifies rollouts from untrusted inference workers, and SHARDCAST, which efficiently broadcasts policy weights from training nodes to inference workers. Beyond infrastructure components, we propose modifications to the standard GRPO training recipe and data filtering techniques that were crucial to achieve training stability and ensure that our model successfully learned its training objective, thus improving upon QwQ-32B, the state of the art reasoning model in the 32B parameter range. We open-source INTELLECT-2 along with all of our code and data, hoping to encourage and enable more open research in the field of decentralized training.

摘要

我们介绍了INTELLECT-2,这是首个在全球范围内分布式进行的320亿参数语言模型强化学习(RL)训练项目。与传统集中式训练不同,INTELLECT-2通过完全异步的强化学习方式,在一个动态、异构且无需许可的计算贡献者集群上训练推理模型。为实现这种独特基础设施下的训练,我们从零开始构建了多个核心组件:提出了专为分布式异步强化学习设计的训练框架PRIME-RL,其底层包含创新性组件——用于验证不可信推理工作节点计算结果的TOPLOC,以及高效将策略权重从训练节点广播至推理工作节点的SHARDCAST。除基础设施组件外,我们对标准GRPO训练方案和数据过滤技术进行了改进,这些改进对保持训练稳定性、确保模型成功学习训练目标至关重要,从而使我们的模型超越了当前320亿参数范围内最先进的推理模型QwQ-32B。我们将INTELLECT-2及其全部代码和数据开源,以期推动并赋能分布式训练领域更开放的学术研究。


QUPID: Quantified Understanding for Enhanced Performance, Insights, and Decisions in Korean Search Engines

Abstract

arXiv:2505.07345v1 Announce Type: cross Abstract: Large language models (LLMs) have been widely used for relevance assessment in information retrieval. However, our study demonstrates that combining two distinct small language models (SLMs) with different architectures can outperform LLMs in this task. Our approach -- QUPID -- integrates a generative SLM with an embedding-based SLM, achieving higher relevance judgment accuracy while reducing computational costs compared to state-of-the-art LLM solutions. This computational efficiency makes QUPID highly scalable for real-world search systems processing millions of queries daily. In experiments across diverse document types, our method demonstrated consistent performance improvements (Cohen's Kappa of 0.646 versus 0.387 for leading LLMs) while offering 60x faster inference times. Furthermore, when integrated into production search pipelines, QUPID improved nDCG@5 scores by 1.9%. These findings underscore how architectural diversity in model combinations can significantly enhance both search relevance and operational efficiency in information retrieval systems.

摘要

大型语言模型(LLMs)在信息检索的相关性评估中已被广泛应用。然而,我们的研究表明,结合两种不同架构的小型语言模型(SLMs)在此任务上能够超越LLMs的表现。我们提出的QUPID方法将生成式SLM与基于嵌入的SLM相融合,相较于最先进的LLM解决方案,在提升相关性判断准确率的同时降低了计算成本。这种计算效率使得QUPID在处理每日数百万查询的实际搜索系统中具有高度可扩展性。在多种文档类型的实验中,该方法展现出稳定的性能提升(Cohen's Kappa系数达0.646,而主流LLMs为0.387),且推理速度加快60倍。当部署至生产环境搜索管道时,QUPID使nDCG@5分数提高了1.9%。这些发现印证了模型组合的架构多样性如何显著提升信息检索系统的搜索相关性和运行效率。


AI in Money Matters

Abstract

arXiv:2505.07393v1 Announce Type: cross Abstract: In November 2022, Europe and the world by and large were stunned by the birth of a new large language model : ChatGPT. Ever since then, both academic and populist discussions have taken place in various public spheres such as LinkedIn and X(formerly known as Twitter) with the view to both understand the tool and its benefits for the society. The views of real actors in professional spaces, especially in regulated industries such as finance and law have been largely missing. We aim to begin to close this gap by presenting results from an empirical investigation conducted through interviews with professional actors in the Fintech industry. The paper asks the question, how and to what extent are large language models in general and ChatGPT in particular being adopted and used in the Fintech industry? The results show that while the fintech experts we spoke with see a potential in using large language models in the future, a lot of questions marks remain concerning how they are policed and therefore might be adopted in a regulated industry such as Fintech. This paper aims to add to the existing academic discussing around large language models, with a contribution to our understanding of professional viewpoints.

摘要

2022年11月,新一代大型语言模型ChatGPT的诞生令欧洲乃至全球为之震惊。此后,学术界和大众领域在领英、X(原推特)等各类公共平台展开讨论,旨在理解该工具及其社会效益。然而,专业领域从业者——尤其是金融、法律等受监管行业的真实观点长期缺位。本研究通过访谈金融科技行业专业人士的实证调查,试图填补这一空白。本文核心研究问题是:金融科技行业对大型语言模型(特别是ChatGPT)的采纳程度及使用方式如何?研究结果表明,尽管受访的金融科技专家认为未来大型语言模型具有应用潜力,但在监管框架如何约束这类技术、进而影响其在金融科技等受监管行业的应用方面,仍存在诸多未解之谜。本文旨在拓展现有关于大型语言模型的学术讨论,通过呈现专业视角深化学界认知。


Examining the Role of LLM-Driven Interactions on Attention and Cognitive Engagement in Virtual Classrooms

Abstract

arXiv:2505.07377v1 Announce Type: cross Abstract: Transforming educational technologies through the integration of large language models (LLMs) and virtual reality (VR) offers the potential for immersive and interactive learning experiences. However, the effects of LLMs on user engagement and attention in educational environments remain open questions. In this study, we utilized a fully LLM-driven virtual learning environment, where peers and teachers were LLM-driven, to examine how students behaved in such settings. Specifically, we investigate how peer question-asking behaviors influenced student engagement, attention, cognitive load, and learning outcomes and found that, in conditions where LLM-driven peer learners asked questions, students exhibited more targeted visual scanpaths, with their attention directed toward the learning content, particularly in complex subjects. Our results suggest that peer questions did not introduce extraneous cognitive load directly, as the cognitive load is strongly correlated with increased attention to the learning material. Considering these findings, we provide design recommendations for optimizing VR learning spaces.

摘要

通过整合大语言模型(LLMs)与虚拟现实(VR)技术革新教育科技,有望实现沉浸式互动学习体验。然而,LLMs在教育环境中对用户参与度与注意力影响的研究尚存空白。本研究构建了完全由LLM驱动的虚拟学习环境(其中同伴与教师均为LLM代理),以探究学生在此类情境中的行为模式。具体而言,我们分析了同伴提问行为如何影响学生的参与度、注意力、认知负荷及学习成效,结果发现:当LLM驱动的同伴学习者发起提问时,学生会表现出更具针对性的视觉扫描路径,其注意力更集中于学习内容——尤其在复杂学科领域。研究表明,同伴提问并未直接引入额外认知负荷,因为认知负荷的增强与学习材料注意力的提升呈显著正相关。基于这些发现,我们提出了优化VR学习空间的设计建议。


Synthetic Code Surgery: Repairing Bugs and Vulnerabilities with LLMs and Synthetic Data

Abstract

arXiv:2505.07372v1 Announce Type: cross Abstract: This paper presents a novel methodology for enhancing Automated Program Repair (APR) through synthetic data generation utilizing Large Language Models (LLMs). Current APR systems are constrained by the limited availability of high-quality training data encompassing diverse bug types across multiple programming languages. The proposed approach addresses this limitation through a two-phase process: a synthetic sample generation followed by a rigorous quality assessment. Multiple state-of-the-art LLMs were employed to generate approximately 30,000 paired examples of buggy and fixed code across 12 programming languages and 13 bug categories. Subsequently, these samples underwent cross-model evaluation against five criteria: correctness, code quality, security, performance, and completeness. Experimental evaluation on the VulRepair test set dataset showed statistically significant improvements in Perfect Prediction rates, with the quality-filtered synthetic dataset outperforming both baseline and real-world commit data configurations in certain scenarios. The methodology was validated through rigorous statistical testing, including ANOVA and post-hoc Tukey's Honest Significant Difference analysis. Furthermore, the best-performing configurations surpassed existing systems despite using a less computationally intensive decoding strategy. This research establishes a self-bootstrapping paradigm in which LLMs generate and evaluate their own training data, potentially transforming approaches to data scarcity across software engineering tasks and advancing the development of robust, adaptable tools for automated code maintenance.

摘要

本文提出了一种利用大型语言模型(LLMs)通过合成数据生成增强自动化程序修复(APR)的新方法。当前APR系统受限于跨多种编程语言的多样化缺陷类型高质量训练数据的有限可用性。该研究通过两阶段流程解决这一局限:首先生成合成样本,随后进行严格质量评估。研究采用多种前沿LLMs生成了约30,000对跨12种编程语言和13种缺陷类别的缺陷代码与修复代码示例。这些样本随后接受了基于五项标准的跨模型评估:正确性、代码质量、安全性、性能和完整性。在VulRepair测试集上的实验评估显示,经过质量筛选的合成数据集在完美预测率方面取得统计显著性提升,在某些场景下优于基线和真实提交数据配置。该方法通过严格的统计检验得到验证,包括方差分析和事后Tukey真实显著性差异分析。此外,最佳性能配置在采用较低计算强度的解码策略情况下仍超越现有系统。本研究建立了一种自举范式,即LLMs生成并评估自身训练数据,有望革新软件工程任务中数据稀缺的应对方法,推动开发更健壮、适应性更强的自动化代码维护工具。


Can Generative AI agents behave like humans? Evidence from laboratory market experiments

Abstract

arXiv:2505.07457v1 Announce Type: cross Abstract: We explore the potential of Large Language Models (LLMs) to replicate human behavior in economic market experiments. Compared to previous studies, we focus on dynamic feedback between LLM agents: the decisions of each LLM impact the market price at the current step, and so affect the decisions of the other LLMs at the next step. We compare LLM behavior to market dynamics observed in laboratory settings and assess their alignment with human participants' behavior. Our findings indicate that LLMs do not adhere strictly to rational expectations, displaying instead bounded rationality, similarly to human participants. Providing a minimal context window i.e. memory of three previous time steps, combined with a high variability setting capturing response heterogeneity, allows LLMs to replicate broad trends seen in human experiments, such as the distinction between positive and negative feedback markets. However, differences remain at a granular level--LLMs exhibit less heterogeneity in behavior than humans. These results suggest that LLMs hold promise as tools for simulating realistic human behavior in economic contexts, though further research is needed to refine their accuracy and increase behavioral diversity.

摘要

我们探讨了大型语言模型(LLMs)在经济学市场实验中复现人类行为的潜力。与先前研究相比,我们重点关注LLM智能体之间的动态反馈机制:每个LLM的决策会影响当前阶段的市场价格,从而进一步影响其他LLM在下一阶段的决策。通过将LLM行为与实验室环境下观察到的市场动态进行对比,我们评估了其与人类参与者行为的一致性。研究发现,LLMs并不严格遵循理性预期原则,而是与人类参与者类似表现出有限理性特征。当提供包含前三阶段记忆的最小上下文窗口,并结合反映响应异质性的高可变性设置时,LLMs能够复现人类实验中的宏观趋势,例如正反馈与负反馈市场的区分特征。然而在微观层面仍存在差异——LLMs表现出的行为异质性低于人类。这些结果表明,LLMs有望成为经济情境中模拟真实人类行为的有效工具,但需进一步研究以提高其准确性并增强行为多样性。


ToolACE-DEV: Self-Improving Tool Learning via Decomposition and EVolution

Abstract

arXiv:2505.07512v1 Announce Type: cross Abstract: The tool-using capability of large language models (LLMs) enables them to access up-to-date external information and handle complex tasks. Current approaches to enhancing this capability primarily rely on distilling advanced models by data synthesis. However, this method incurs significant costs associated with advanced model usage and often results in data compatibility issues, led by the high discrepancy in the knowledge scope between the advanced model and the target model. To address these challenges, we propose ToolACE-DEV, a self-improving framework for tool learning. First, we decompose the tool-learning objective into sub-tasks that enhance basic tool-making and tool-using abilities. Then, we introduce a self-evolving paradigm that allows lightweight models to self-improve, reducing reliance on advanced LLMs. Extensive experiments validate the effectiveness of our approach across models of varying scales and architectures.

摘要

大语言模型(LLMs)的工具使用能力使其能够获取最新外部信息并处理复杂任务。当前增强该能力的方法主要依赖于通过数据合成来蒸馏高级模型。然而,这种方法不仅会带来高级模型使用的高昂成本,还常因目标模型与高级模型在知识范围上的显著差异而导致数据兼容性问题。为解决这些挑战,我们提出了ToolACE-DEV——一种工具学习的自我提升框架。首先,我们将工具学习目标分解为提升基础工具制造与工具使用能力的子任务;其次,引入自进化范式,使轻量级模型能够自我改进,从而降低对高级LLMs的依赖。大量实验验证了该方法在不同规模和架构模型上的有效性。


LEAD: Iterative Data Selection for Efficient LLM Instruction Tuning

Abstract

arXiv:2505.07437v1 Announce Type: cross Abstract: Instruction tuning has emerged as a critical paradigm for improving the capabilities and alignment of large language models (LLMs). However, existing iterative model-aware data selection methods incur significant computational overhead, as they rely on repeatedly performing full-dataset model inference to estimate sample utility for subsequent training iterations, creating a fundamental efficiency bottleneck. In this paper, we propose LEAD, an efficient iterative data selection framework that accurately estimates sample utility entirely within the standard training loop, eliminating the need for costly additional model inference. At its core, LEAD introduces Instance-Level Dynamic Uncertainty (IDU), a theoretically grounded utility function combining instantaneous training loss, gradient-based approximation of loss changes, and exponential smoothing of historical loss signals. To further scale efficiently to large datasets, LEAD employs a two-stage, coarse-to-fine selection strategy, adaptively prioritizing informative clusters through a multi-armed bandit mechanism, followed by precise fine-grained selection of high-utility samples using IDU. Extensive experiments across four diverse benchmarks show that LEAD significantly outperforms state-of-the-art methods, improving average model performance by 6.1%-10.8% while using only 2.5% of the training data and reducing overall training time by 5-10x.

摘要

指令微调已成为提升大语言模型(LLM)能力与对齐性的关键范式。然而,现有基于模型感知的迭代数据选择方法需依赖重复的全数据集模型推理来评估样本效用,导致计算开销巨大,形成根本性的效率瓶颈。本文提出LEAD框架——一种高效的迭代数据选择方法,其创新性在于完全在标准训练循环内精准估计样本效用,无需额外模型推理成本。该框架核心是提出理论完备的实例级动态不确定性(IDU)效用函数,融合瞬时训练损失、基于梯度的损失变化近似及历史损失信号的指数平滑。为高效扩展至大规模数据集,LEAD采用两阶段由粗到精的选择策略:通过多臂老虎机机制自适应优先选择信息量大的聚类簇,继而使用IDU精准筛选高效用样本。在四个多样化基准测试上的实验表明,LEAD显著优于现有最优方法,仅需2.5%训练数据即可将模型平均性能提升6.1%-10.8%,同时降低5-10倍总体训练时间。


GRADA: Graph-based Reranker against Adversarial Documents Attack

Abstract

arXiv:2505.07546v1 Announce Type: cross Abstract: Retrieval Augmented Generation (RAG) frameworks improve the accuracy of large language models (LLMs) by integrating external knowledge from retrieved documents, thereby overcoming the limitations of models' static intrinsic knowledge. However, these systems are susceptible to adversarial attacks that manipulate the retrieval process by introducing documents that are adversarial yet semantically similar to the query. Notably, while these adversarial documents resemble the query, they exhibit weak similarity to benign documents in the retrieval set. Thus, we propose a simple yet effective Graph-based Reranking against Adversarial Document Attacks (GRADA) framework aiming at preserving retrieval quality while significantly reducing the success of adversaries. Our study evaluates the effectiveness of our approach through experiments conducted on five LLMs: GPT-3.5-Turbo, GPT-4o, Llama3.1-8b, Llama3.1-70b, and Qwen2.5-7b. We use three datasets to assess performance, with results from the Natural Questions dataset demonstrating up to an 80% reduction in attack success rates while maintaining minimal loss in accuracy.

摘要

检索增强生成(RAG)框架通过整合检索文档的外部知识,提升了大型语言模型(LLM)的准确性,从而克服了模型静态内在知识的局限性。然而,此类系统易受到对抗性攻击的影响,攻击者通过引入与查询语义相似但具有对抗性的文档来操纵检索过程。值得注意的是,尽管这些对抗性文档与查询相似,但其与检索集中良性文档的相似性较弱。因此,我们提出了一种简单而有效的基于图的重排序框架(GRADA),旨在保持检索质量的同时显著降低对抗攻击的成功率。本研究通过在五种LLM(GPT-3.5-Turbo、GPT-4o、Llama3.1-8b、Llama3.1-70b和Qwen2.5-7b)上进行的实验评估了该方法的有效性。我们使用三个数据集评估性能,其中Natural Questions数据集的结果表明,在保持精度损失最小的情况下,攻击成功率最高可降低80%。


Towards Requirements Engineering for RAG Systems

Abstract

arXiv:2505.07553v1 Announce Type: cross Abstract: This short paper explores how a maritime company develops and integrates large-language models (LLM). Specifically by looking at the requirements engineering for Retrieval Augmented Generation (RAG) systems in expert settings. Through a case study at a maritime service provider, we demonstrate how data scientists face a fundamental tension between user expectations of AI perfection and the correctness of the generated outputs. Our findings reveal that data scientists must identify context-specific "retrieval requirements" through iterative experimentation together with users because they are the ones who can determine correctness. We present an empirical process model describing how data scientists practically elicited these "retrieval requirements" and managed system limitations. This work advances software engineering knowledge by providing insights into the specialized requirements engineering processes for implementing RAG systems in complex domain-specific applications.

摘要

这篇短文探讨了一家海事公司如何开发并整合大语言模型(LLM),具体通过研究专家场景下检索增强生成(RAG)系统的需求工程来实现。通过对某海事服务提供商的案例研究,我们揭示了数据科学家在用户对人工智能完美表现的期望与生成输出正确性之间所面临的根本矛盾。研究发现表明,数据科学家必须通过与用户的迭代实验来确定特定情境下的"检索需求",因为只有用户能够判定正确性。我们提出了一个实证流程模型,描述数据科学家如何实际获取这些"检索需求"并管理系统局限性。这项工作通过为复杂领域特定应用中实施RAG系统的专业化需求工程流程提供见解,从而推进了软件工程知识的发展。


A Multi-Dimensional Constraint Framework for Evaluating and Improving Instruction Following in Large Language Models

Abstract

arXiv:2505.07591v1 Announce Type: cross Abstract: Instruction following evaluates large language models (LLMs) on their ability to generate outputs that adhere to user-defined constraints. However, existing benchmarks often rely on templated constraint prompts, which lack the diversity of real-world usage and limit fine-grained performance assessment. To fill this gap, we propose a multi-dimensional constraint framework encompassing three constraint patterns, four constraint categories, and four difficulty levels. Building on this framework, we develop an automated instruction generation pipeline that performs constraint expansion, conflict detection, and instruction rewriting, yielding 1,200 code-verifiable instruction-following test samples. We evaluate 19 LLMs across seven model families and uncover substantial variation in performance across constraint forms. For instance, average performance drops from 77.67% at Level I to 32.96% at Level IV. Furthermore, we demonstrate the utility of our approach by using it to generate data for reinforcement learning, achieving substantial gains in instruction following without degrading general performance. In-depth analysis indicates that these gains stem primarily from modifications in the model's attention modules parameters, which enhance constraint recognition and adherence. Code and data are available in https://github.com/Junjie-Ye/MulDimIF.

摘要

指令遵循能力评估旨在检验大语言模型(LLM)生成符合用户定义约束输出的能力。然而现有基准测试多采用模板化约束提示,缺乏真实场景的多样性且难以进行细粒度性能评估。为此,我们提出一个多维约束框架,包含三种约束模式、四类约束范畴和四个难度等级。基于该框架,我们开发了自动化指令生成流程,通过约束扩展、冲突检测和指令重写,构建了1,200个可代码验证的指令遵循测试样本。我们对7个模型家族的19个LLM进行评估,发现不同约束形式的性能存在显著差异(例如平均性能从难度I级的77.67%降至IV级的32.96%)。进一步地,我们通过生成强化学习训练数据验证了本方法的实用性,在保持通用性能的同时显著提升了指令遵循能力。深入分析表明,这种提升主要源于模型注意力模块参数的调整,从而增强了约束识别与遵循能力。代码与数据详见https://github.com/Junjie-Ye/MulDimIF。


Reinforced Internal-External Knowledge Synergistic Reasoning for Efficient Adaptive Search Agent

Abstract

arXiv:2505.07596v1 Announce Type: cross Abstract: Retrieval-augmented generation (RAG) is a common strategy to reduce hallucinations in Large Language Models (LLMs). While reinforcement learning (RL) can enable LLMs to act as search agents by activating retrieval capabilities, existing ones often underutilize their internal knowledge. This can lead to redundant retrievals, potential harmful knowledge conflicts, and increased inference latency. To address these limitations, an efficient and adaptive search agent capable of discerning optimal retrieval timing and synergistically integrating parametric (internal) and retrieved (external) knowledge is in urgent need. This paper introduces the Reinforced Internal-External Knowledge Synergistic Reasoning Agent (IKEA), which could indentify its own knowledge boundary and prioritize the utilization of internal knowledge, resorting to external search only when internal knowledge is deemed insufficient. This is achieved using a novel knowledge-boundary aware reward function and a knowledge-boundary aware training dataset. These are designed for internal-external knowledge synergy oriented RL, incentivizing the model to deliver accurate answers, minimize unnecessary retrievals, and encourage appropriate external searches when its own knowledge is lacking. Evaluations across multiple knowledge reasoning tasks demonstrate that IKEA significantly outperforms baseline methods, reduces retrieval frequency significantly, and exhibits robust generalization capabilities.

摘要

检索增强生成(RAG)是减少大语言模型(LLM)幻觉的常见策略。尽管强化学习(RL)能够通过激活检索能力使LLM充当搜索代理,但现有方法往往未能充分利用其内部知识。这可能导致冗余检索、潜在有害的知识冲突以及推理延迟增加。为解决这些局限性,亟需一种高效且自适应的搜索代理,能够判断最佳检索时机并协同整合参数化(内部)知识与检索(外部)知识。本文提出强化内外知识协同推理代理(IKEA),该代理可识别自身知识边界并优先利用内部知识,仅在内部知识不足时启动外部检索。这一机制通过新型知识边界感知奖励函数和知识边界感知训练数据集实现,二者专为面向内外知识协同的强化学习设计,激励模型提供准确答案、最小化不必要检索,并在自身知识欠缺时鼓励适当的外部搜索。在多项知识推理任务上的评估表明,IKEA显著优于基线方法,大幅降低检索频率,并展现出强大的泛化能力。


A Case Study Investigating the Role of Generative AI in Quality Evaluations of Epics in Agile Software Development

Abstract

arXiv:2505.07664v1 Announce Type: cross Abstract: The broad availability of generative AI offers new opportunities to support various work domains, including agile software development. Agile epics are a key artifact for product managers to communicate requirements to stakeholders. However, in practice, they are often poorly defined, leading to churn, delivery delays, and cost overruns. In this industry case study, we investigate opportunities for large language models (LLMs) to evaluate agile epic quality in a global company. Results from a user study with 17 product managers indicate how LLM evaluations could be integrated into their work practices, including perceived values and usage in improving their epics. High levels of satisfaction indicate that agile epics are a new, viable application of AI evaluations. However, our findings also outline challenges, limitations, and adoption barriers that can inform both practitioners and researchers on the integration of such evaluations into future agile work practices.

摘要

生成式人工智能的广泛普及为支持包括敏捷软件开发在内的各个工作领域提供了新机遇。敏捷史诗是产品经理向利益相关者传达需求的关键工件,但在实践中往往定义不完善,导致返工、交付延迟和成本超支。本行业案例研究探讨了在全球化企业中利用大语言模型(LLM)评估敏捷史诗质量的可行性。通过对17名产品经理开展用户研究,结果表明LLM评估可融入其工作实践,包括在改进史诗过程中体现的感知价值与应用场景。高满意度数据显示敏捷史诗是AI评估的新型可行应用领域。然而,研究结果也揭示了挑战、局限性及采用障碍,这些发现可为从业者和研究者未来将此类评估整合至敏捷工作实践提供参考。


MiMo: Unlocking the Reasoning Potential of Language Model -- From Pretraining to Posttraining

Abstract

arXiv:2505.07608v1 Announce Type: cross Abstract: We present MiMo-7B, a large language model born for reasoning tasks, with optimization across both pre-training and post-training stages. During pre-training, we enhance the data preprocessing pipeline and employ a three-stage data mixing strategy to strengthen the base model's reasoning potential. MiMo-7B-Base is pre-trained on 25 trillion tokens, with additional Multi-Token Prediction objective for enhanced performance and accelerated inference speed. During post-training, we curate a dataset of 130K verifiable mathematics and programming problems for reinforcement learning, integrating a test-difficulty-driven code-reward scheme to alleviate sparse-reward issues and employing strategic data resampling to stabilize training. Extensive evaluations show that MiMo-7B-Base possesses exceptional reasoning potential, outperforming even much larger 32B models. The final RL-tuned model, MiMo-7B-RL, achieves superior performance on mathematics, code and general reasoning tasks, surpassing the performance of OpenAI o1-mini. The model checkpoints are available at https://github.com/xiaomimimo/MiMo.

摘要

我们提出MiMo-7B——一个为推理任务而生的大型语言模型,该模型在预训练和训练后阶段均进行了优化。在预训练阶段,我们改进了数据预处理流程,采用三阶段数据混合策略以增强基础模型的推理潜力。MiMo-7B-Base基于25万亿token进行预训练,并引入多token预测目标以提升性能并加速推理速度。在训练后阶段,我们构建了包含13万个可验证数学与编程问题的强化学习数据集,通过测试难度驱动的代码奖励机制缓解稀疏奖励问题,并采用策略性数据重采样以稳定训练。大量评估表明,MiMo-7B-Base具备卓越的推理潜力,其表现甚至优于规模更大的32B模型。经过强化学习调优的最终模型MiMo-7B-RL在数学、代码和通用推理任务上均展现出优越性能,超越了OpenAI o1-mini的表现。模型检查点已发布于https://github.com/xiaomimimo/MiMo。


Characterizing the Investigative Methods of Fictional Detectives with Large Language Models

Abstract

arXiv:2505.07601v1 Announce Type: cross Abstract: Detective fiction, a genre defined by its complex narrative structures and character-driven storytelling, presents unique challenges for computational narratology, a research field focused on integrating literary theory into automated narrative generation. While traditional literary studies have offered deep insights into the methods and archetypes of fictional detectives, these analyses often focus on a limited number of characters and lack the scalability needed for the extraction of unique traits that can be used to guide narrative generation methods. In this paper, we present an AI-driven approach for systematically characterizing the investigative methods of fictional detectives. Our multi-phase workflow explores the capabilities of 15 Large Language Models (LLMs) to extract, synthesize, and validate distinctive investigative traits of fictional detectives. This approach was tested on a diverse set of seven iconic detectives - Hercule Poirot, Sherlock Holmes, William Murdoch, Columbo, Father Brown, Miss Marple, and Auguste Dupin - capturing the distinctive investigative styles that define each character. The identified traits were validated against existing literary analyses and further tested in a reverse identification phase, achieving an overall accuracy of 91.43%, demonstrating the method's effectiveness in capturing the distinctive investigative approaches of each detective. This work contributes to the broader field of computational narratology by providing a scalable framework for character analysis, with potential applications in AI-driven interactive storytelling and automated narrative generation.

摘要

侦探小说作为一种以复杂叙事结构和角色驱动情节为特征的文学类型,为计算叙事学这一致力于将文学理论融入自动化叙事生成的研究领域带来了独特挑战。尽管传统文学研究已对虚构侦探的推理方法与人物原型提供了深刻见解,但这些分析往往局限于少数角色,缺乏可扩展性以提取能指导叙事生成方法的独特特征。本研究提出一种人工智能驱动的系统性方法,用于刻画虚构侦探的调查方法特征。通过多阶段工作流程,我们探索了15种大型语言模型在提取、综合和验证虚构侦探独特调查特质方面的能力。该方法在七位标志性侦探角色(赫尔克里·波洛、夏洛克·福尔摩斯、威廉·默多克、哥伦布、布朗神父、马普尔小姐和奥古斯特·杜宾)构成的多样化数据集上进行测试,成功捕捉了每位角色的典型调查风格。所识别特征既通过现有文学分析验证,又在逆向识别阶段进一步测试,最终达到91.43%的总体准确率,证实了该方法在捕捉侦探独特调查方式方面的有效性。本研究为计算叙事学领域提供了可扩展的角色分析框架,对人工智能驱动的交互式叙事和自动化叙事生成具有潜在应用价值。


Benchmarking Retrieval-Augmented Generation for Chemistry

Abstract

arXiv:2505.07671v1 Announce Type: cross Abstract: Retrieval-augmented generation (RAG) has emerged as a powerful framework for enhancing large language models (LLMs) with external knowledge, particularly in scientific domains that demand specialized and dynamic information. Despite its promise, the application of RAG in the chemistry domain remains underexplored, primarily due to the lack of high-quality, domain-specific corpora and well-curated evaluation benchmarks. In this work, we introduce ChemRAG-Bench, a comprehensive benchmark designed to systematically assess the effectiveness of RAG across a diverse set of chemistry-related tasks. The accompanying chemistry corpus integrates heterogeneous knowledge sources, including scientific literature, the PubChem database, PubMed abstracts, textbooks, and Wikipedia entries. In addition, we present ChemRAG-Toolkit, a modular and extensible RAG toolkit that supports five retrieval algorithms and eight LLMs. Using ChemRAG-Toolkit, we demonstrate that RAG yields a substantial performance gain -- achieving an average relative improvement of 17.4% over direct inference methods. We further conduct in-depth analyses on retriever architectures, corpus selection, and the number of retrieved passages, culminating in practical recommendations to guide future research and deployment of RAG systems in the chemistry domain. The code and data is available at https://chemrag.github.io.

摘要

检索增强生成(RAG)作为一种增强大语言模型(LLM)外部知识能力的强大框架,在需要专业化动态信息的科学领域展现出显著价值。尽管前景广阔,RAG在化学领域的应用仍存在探索不足的问题,主要归因于缺乏高质量领域专用语料库和精心构建的评估基准。本研究提出ChemRAG-Bench——一个旨在系统评估RAG在多样化化学相关任务中有效性的综合基准。配套的化学语料库整合了多源异构知识,包括科学文献、PubChem数据库、PubMed摘要、教科书及维基百科条目。此外,我们开发了模块化可扩展的ChemRAG-Toolkit工具包,支持五种检索算法和八种LLM。通过该工具包验证,RAG较直接推理方法平均获得17.4%的相对性能提升。我们进一步对检索器架构、语料选择及检索段落数量进行深度分析,最终提出指导化学领域RAG系统未来研究与部署的实用建议。代码与数据详见https://chemrag.github.io。


OnPrem.LLM: A Privacy-Conscious Document Intelligence Toolkit

Abstract

arXiv:2505.07672v1 Announce Type: cross Abstract: We present OnPrem.LLM, a Python-based toolkit for applying large language models (LLMs) to sensitive, non-public data in offline or restricted environments. The system is designed for privacy-preserving use cases and provides prebuilt pipelines for document processing and storage, retrieval-augmented generation (RAG), information extraction, summarization, classification, and prompt/output processing with minimal configuration. OnPrem.LLM supports multiple LLM backends -- including llama.cpp, Ollama, vLLM, and Hugging Face Transformers -- with quantized model support, GPU acceleration, and seamless backend switching. Although designed for fully local execution, OnPrem.LLM also supports integration with a wide range of cloud LLM providers when permitted, enabling hybrid deployments that balance performance with data control. A no-code web interface extends accessibility to non-technical users.

摘要

我们推出OnPrem.LLM——一个基于Python的工具包,专为在离线或受限环境中处理敏感非公开数据的大型语言模型(LLM)应用而设计。该系统针对隐私保护场景开发,提供开箱即用的文档处理与存储、检索增强生成(RAG)、信息抽取、摘要生成、分类以及提示/输出处理等预构建流程,且仅需最小化配置。OnPrem.LLM支持多种LLM后端(包括llama.cpp、Ollama、vLLM和Hugging Face Transformers),具备量化模型支持、GPU加速和无缝后端切换能力。尽管设计初衷是实现完全本地化运行,该系统在获得许可时仍支持与各类云端LLM服务商集成,从而构建兼顾性能与数据控制的混合部署方案。通过无代码网页界面,该工具还能为非技术用户提供便捷访问途径。


Concept-Level Explainability for Auditing & Steering LLM Responses

Abstract

arXiv:2505.07610v1 Announce Type: cross Abstract: As large language models (LLMs) become widely deployed, concerns about their safety and alignment grow. An approach to steer LLM behavior, such as mitigating biases or defending against jailbreaks, is to identify which parts of a prompt influence specific aspects of the model's output. Token-level attribution methods offer a promising solution, but still struggle in text generation, explaining the presence of each token in the output separately, rather than the underlying semantics of the entire LLM response. We introduce ConceptX, a model-agnostic, concept-level explainability method that identifies the concepts, i.e., semantically rich tokens in the prompt, and assigns them importance based on the outputs' semantic similarity. Unlike current token-level methods, ConceptX also offers to preserve context integrity through in-place token replacements and supports flexible explanation goals, e.g., gender bias. ConceptX enables both auditing, by uncovering sources of bias, and steering, by modifying prompts to shift the sentiment or reduce the harmfulness of LLM responses, without requiring retraining. Across three LLMs, ConceptX outperforms token-level methods like TokenSHAP in both faithfulness and human alignment. Steering tasks boost sentiment shift by 0.252 versus 0.131 for random edits and lower attack success rates from 0.463 to 0.242, outperforming attribution and paraphrasing baselines. While prompt engineering and self-explaining methods sometimes yield safer responses, ConceptX offers a transparent and faithful alternative for improving LLM safety and alignment, demonstrating the practical value of attribution-based explainability in guiding LLM behavior.

摘要

随着大语言模型(LLMs)的广泛应用,其安全性与对齐性问题日益受到关注。引导LLM行为(如缓解偏见或防御越狱攻击)的一种方法是识别提示中影响模型输出特定方面的关键部分。虽然基于词元的归因方法提供了可行方案,但在文本生成任务中仍存在局限——这些方法仅能分别解释输出中每个词元的出现原因,而无法揭示整个LLM响应背后的语义逻辑。我们提出ConceptX,这是一种与模型无关的概念级可解释性方法,它能识别提示中的概念(即具有丰富语义的词元),并根据输出语义相似度为其分配重要性权重。与现有词元级方法不同,ConceptX通过原位词元替换保持上下文完整性,并支持灵活的解释目标(如性别偏见)。该方法既可实现审计功能(通过揭示偏见来源),又能进行行为引导(通过修改提示词来改变情感倾向或降低LLM响应危害性),且无需重新训练模型。在三种LLM上的实验表明,ConceptX在忠实度和人类对齐性方面均优于TokenSHAP等词元级方法。在引导任务中,相较于随机编辑的0.131情感偏移量,ConceptX达到0.252的提升幅度,并将攻击成功率从0.463降至0.242,其表现优于归因和改写基线方法。尽管提示工程和自解释方法有时能产生更安全的响应,但ConceptX为提升LLM安全性和对齐性提供了透明可靠的替代方案,这证明了基于归因的可解释性方法在引导LLM行为方面的实用价值。


SpecRouter: Adaptive Routing for Multi-Level Speculative Decoding in Large Language Models

Abstract

arXiv:2505.07680v1 Announce Type: cross Abstract: Large Language Models (LLMs) present a critical trade-off between inference quality and computational cost: larger models offer superior capabilities but incur significant latency, while smaller models are faster but less powerful. Existing serving strategies often employ fixed model scales or static two-stage speculative decoding, failing to dynamically adapt to the varying complexities of user requests or fluctuations in system performance. This paper introduces \systemname{}, a novel framework that reimagines LLM inference as an adaptive routing problem solved through multi-level speculative decoding. \systemname{} dynamically constructs and optimizes inference "paths" (chains of models) based on real-time feedback, addressing the limitations of static approaches. Our contributions are threefold: (1) An \textbf{adaptive model chain scheduling} mechanism that leverages performance profiling (execution times) and predictive similarity metrics (derived from token distribution divergence) to continuously select the optimal sequence of draft and verifier models, minimizing predicted latency per generated token. (2) A \textbf{multi-level collaborative verification} framework where intermediate models within the selected chain can validate speculative tokens, reducing the verification burden on the final, most powerful target model. (3) A \textbf{synchronized state management} system providing efficient, consistent KV cache handling across heterogeneous models in the chain, including precise, low-overhead rollbacks tailored for asynchronous batch processing inherent in multi-level speculation. Preliminary experiments demonstrate the validity of our method.

摘要

大语言模型(LLMs)在推理质量与计算成本之间存在关键权衡:较大模型具备更强能力但伴随显著延迟,较小模型响应更快但能力较弱。现有服务策略通常采用固定规模模型或静态两阶段推测解码,无法动态适应用户请求的复杂度变化及系统性能波动。本文提出\systemname{},该创新框架将LLM推理重构为通过多级推测解码解决的自适应路由问题。\systemname{}基于实时反馈动态构建并优化推理"路径"(模型链),以克服静态方法的局限性。我们的贡献包含三方面:(1)自适应模型链调度机制,利用性能分析(执行时间)和预测相似性度量(源自token分布散度)持续选择最优草稿模型与验证模型序列,最小化单token生成延迟;(2)多级协同验证框架,允许选定链中的中间模型验证推测token,减轻最终目标大模型的验证负担;(3)同步状态管理系统,为链中异构模型提供高效一致的KV缓存处理,包括针对多级推测固有异步批处理特性设计的精准低开销回滚机制。初步实验验证了方法的有效性。


Circuit Partitioning Using Large Language Models for Quantum Compilation and Simulations

Abstract

arXiv:2505.07711v1 Announce Type: cross Abstract: We are in the midst of the noisy intermediate-scale quantum (NISQ) era, where quantum computers are limited by noisy gates, some of which are more error-prone than others and can render the final computation incomprehensible. Quantum circuit compilation algorithms attempt to minimize these noisy gates when mapping quantum algorithms onto quantum hardware but face computational challenges that restrict their application to circuits with no more than 5-6 qubits, necessitating the need to partition large circuits before the application of noisy quantum gate minimization algorithms. The existing generation of these algorithms is heuristic in nature and does not account for downstream gate minimization tasks. Large language models (LLMs) have the potential to change this and help improve quantum circuit partitions. This paper investigates the use of LLMs, such as Llama and Mistral, for partitioning quantum circuits by capitalizing on their abilities to understand and generate code, including QASM. Specifically, we teach LLMs to partition circuits using the quick partition approach of the Berkeley Quantum Synthesis Toolkit. Through experimental evaluations, we show that careful fine-tuning of open source LLMs enables us to obtain an accuracy of 53.4% for the partition task while over-the-shelf LLMs are unable to correctly partition circuits, using standard 1-shot and few-shot training approaches.

摘要

我们正处于嘈杂中等规模量子(NISQ)时代,量子计算机受限于噪声门操作,其中某些门比其他门更易出错,可能导致最终计算结果无法解读。量子电路编译算法试图在将量子算法映射到量子硬件时最小化这些噪声门,但面临计算挑战,限制其仅能应用于不超过5-6个量子位的电路,因此需在实施噪声量子门最小化算法前对大型电路进行分割。现有这类算法本质上是启发式的,未考虑后续门最小化任务。大型语言模型(LLMs)有望改变这一现状并改进量子电路分割。本文研究利用Llama和Mistral等LLMs进行量子电路分割,充分发挥其理解和生成代码(包括QASM)的能力。具体而言,我们采用伯克利量子合成工具包的快速分割方法训练LLMs执行电路分割。实验评估表明,通过对开源LLMs的精细调优,分割任务准确率可达53.4%,而采用标准单样本和少样本训练方法的现成LLMs则无法正确分割电路。


Enhancing Code Generation via Bidirectional Comment-Level Mutual Grounding

Abstract

arXiv:2505.07768v1 Announce Type: cross Abstract: Large Language Models (LLMs) have demonstrated unprecedented capability in code generation. However, LLM-generated code is still plagued with a wide range of functional errors, especially for complex programming tasks that LLMs have not seen before. Recent studies have shown that developers often struggle with inspecting and fixing incorrect code generated by LLMs, diminishing their productivity and trust in LLM-based code generation. Inspired by the mutual grounding theory in communication, we propose an interactive approach that leverages code comments as a medium for developers and LLMs to establish a shared understanding. Our approach facilitates iterative grounding by interleaving code generation, inline comment generation, and contextualized user feedback through editable comments to align generated code with developer intent. We evaluated our approach on two popular benchmarks and demonstrated that our approach significantly improved multiple state-of-the-art LLMs, e.g., 17.1% pass@1 improvement for code-davinci-002 on HumanEval. Furthermore, we conducted a user study with 12 participants in comparison to two baselines: (1) interacting with GitHub Copilot, and (2) interacting with a multi-step code generation paradigm called Multi-Turn Program Synthesis. Participants completed the given programming tasks 16.7% faster and with 10.5% improvement in task success rate when using our approach. Both results show that interactively refining code comments enables the collaborative establishment of mutual grounding, leading to more accurate code generation and higher developer confidence.

摘要

大型语言模型(LLMs)在代码生成领域展现出前所未有的能力。然而,LLM生成的代码仍存在广泛的功能性错误,尤其是针对模型未曾见过的复杂编程任务。近期研究表明,开发者往往难以检查和修正LLM生成的错误代码,这降低了其生产力并削弱了对基于LLM的代码生成的信任。受交流中的互为基础理论启发,我们提出一种交互式方法,利用代码注释作为开发者与LLM建立共同理解的媒介。该方法通过交替进行代码生成、内联注释生成以及通过可编辑注释实现的情境化用户反馈,促进迭代式基础建立,从而使生成代码与开发者意图保持一致。我们在两个主流基准测试上评估了该方法,结果表明其显著提升了多个最先进LLM的性能(例如code-davinci-002在HumanEval上的pass@1指标提升17.1%)。此外,我们开展了12人参与的对比用户研究,基线方法包括:(1)与GitHub Copilot交互;(2)与称为多轮程序合成的多步代码生成范式交互。使用本方法时,参与者完成任务速度提升16.7%,任务成功率提高10.5%。两项结果均表明,通过交互式优化代码注释能够协同建立互为基础,从而实现更精确的代码生成并提升开发者信心。


Learning Dynamics in Continual Pre-Training for Large Language Models

Abstract

arXiv:2505.07796v1 Announce Type: cross Abstract: Continual Pre-Training (CPT) has become a popular and effective method to apply strong foundation models to specific downstream tasks. In this work, we explore the learning dynamics throughout the CPT process for large language models. We specifically focus on how general and downstream domain performance evolves at each training step, with domain performance measured via validation losses. We have observed that the CPT loss curve fundamentally characterizes the transition from one curve to another hidden curve, and could be described by decoupling the effects of distribution shift and learning rate annealing. We derive a CPT scaling law that combines the two factors, enabling the prediction of loss at any (continual) training steps and across learning rate schedules (LRS) in CPT. Our formulation presents a comprehensive understanding of several critical factors in CPT, including loss potential, peak learning rate, training steps, replay ratio, etc. Moreover, our approach can be adapted to customize training hyper-parameters to different CPT goals such as balancing general and domain-specific performance. Extensive experiments demonstrate that our scaling law holds across various CPT datasets and training hyper-parameters.

摘要

持续预训练(CPT)已成为将强大基础模型应用于特定下游任务的一种流行且有效的方法。在本研究中,我们探索了大型语言模型在CPT过程中的学习动态。我们特别关注通用领域和下游领域性能在每一步训练中的演变情况,其中领域性能通过验证损失来衡量。我们观察到,CPT损失曲线本质上刻画了从一条曲线向另一条隐藏曲线过渡的特征,并且可以通过解耦分布偏移和学习率退火的影响来描述这一过程。我们推导出一个结合了这两个因素的CPT缩放定律,能够预测CPT中任意(持续)训练步骤以及跨学习率调度(LRS)的损失。我们的公式全面呈现了CPT中若干关键因素,包括损失潜力、峰值学习率、训练步数、回放比例等。此外,我们的方法可适用于根据不同CPT目标(如平衡通用性能与领域特定性能)定制训练超参数。大量实验表明,我们的缩放定律在多种CPT数据集和训练超参数下均成立。


Overflow Prevention Enhances Long-Context Recurrent LLMs

Abstract

arXiv:2505.07793v1 Announce Type: cross Abstract: A recent trend in LLMs is developing recurrent sub-quadratic models that improve long-context processing efficiency. We investigate leading large long-context models, focusing on how their fixed-size recurrent memory affects their performance. Our experiments reveal that, even when these models are trained for extended contexts, their use of long contexts remains underutilized. Specifically, we demonstrate that a chunk-based inference procedure, which identifies and processes only the most relevant portion of the input can mitigate recurrent memory failures and be effective for many long-context tasks: On LongBench, our method improves the overall performance of Falcon3-Mamba-Inst-7B by 14%, Falcon-Mamba-Inst-7B by 28%, RecurrentGemma-IT-9B by 50%, and RWKV6-Finch-7B by 51%. Surprisingly, this simple approach also leads to state-of-the-art results in the challenging LongBench v2 benchmark, showing competitive performance with equivalent size Transformers. Furthermore, our findings raise questions about whether recurrent models genuinely exploit long-range dependencies, as our single-chunk strategy delivers stronger performance - even in tasks that presumably require cross-context relations.

摘要

当前大语言模型(LLM)的发展趋势是构建具有次二次计算复杂度的循环架构模型,以提升长上下文处理效率。本研究针对主流长上下文大模型展开分析,重点探究其固定尺寸循环记忆机制对性能的影响。实验表明,即使这些模型经过长上下文训练,其对长上下文的利用仍不充分。具体而言,我们提出了一种基于分块的推理方法——通过识别并仅处理输入中最相关的片段,可有效缓解循环记忆失效问题:在LongBench基准测试中,该方法使Falcon3-Mamba-Inst-7B整体性能提升14%,Falcon-Mamba-Inst-7B提升28%,RecurrentGemma-IT-9B提升50%,RWKV6-Finch-7B提升51%。值得注意的是,这一简单策略在极具挑战性的LongBench v2基准测试中取得了最先进水平,与同等规模的Transformer模型性能相当。此外,我们的发现对循环模型是否真正利用了长距离依赖关系提出质疑——即使在需要跨上下文关联的任务中,单分块策略仍展现出更强的性能表现。


DSP: Dynamic Sequence Parallelism for Multi-Dimensional Transformers

Abstract

arXiv:2403.10266v5 Announce Type: replace Abstract: Scaling multi-dimensional transformers to long sequences is indispensable across various domains. However, the challenges of large memory requirements and slow speeds of such sequences necessitate sequence parallelism. All existing approaches fall under the category of embedded sequence parallelism, which are limited to shard along a single sequence dimension, thereby introducing significant communication overhead. However, the nature of multi-dimensional transformers involves independent calculations across multiple sequence dimensions. To this end, we propose Dynamic Sequence Parallelism (DSP) as a novel abstraction of sequence parallelism. DSP dynamically switches the parallel dimension among all sequences according to the computation stage with efficient resharding strategy. DSP offers significant reductions in communication costs, adaptability across modules, and ease of implementation with minimal constraints. Experimental evaluations demonstrate DSP's superiority over state-of-the-art embedded sequence parallelism methods by remarkable throughput improvements ranging from 32.2% to 10x, with less than 25% communication volume.

摘要

将多维Transformer模型扩展到长序列处理是多个领域的核心需求。然而,大内存占用和低速处理等挑战使得序列并行技术成为必要。现有方法均属于嵌入式序列并行范畴,仅支持单一序列维度的分片,导致显著的通信开销。但多维Transformer的本质特征在于跨多个序列维度的独立计算。为此,我们提出动态序列并行(DSP)作为序列并行技术的新范式。DSP通过高效的重分片策略,根据计算阶段动态切换所有序列的并行维度,具有通信成本大幅降低、模块适应性强、实现约束少等优势。实验评估表明,DSP相较最先进的嵌入式序列并行方法具有显著优势,吞吐量提升幅度达32.2%至10倍,同时通信量减少75%以上。


CHARTOM: A Visual Theory-of-Mind Benchmark for Multimodal Large Language Models

Abstract

arXiv:2408.14419v2 Announce Type: replace Abstract: We introduce CHARTOM, a visual theory-of-mind benchmark for multimodal large language models. CHARTOM consists of specially designed data visualizing charts. Given a chart, a language model needs to not only correctly comprehend the chart (the FACT question) but also judge if the chart will be misleading to a human reader (the MIND question). Both questions have significant societal benefits. We detail the construction of the CHARTOM benchmark including its calibration on human performance. We benchmark leading LLMs as of late 2024 - including GPT, Claude, Gemini, Qwen, Llama, and Llava - on the CHARTOM dataset and found that our benchmark was challenging to all of them, suggesting room for future large language models to improve.

摘要

我们推出CHARTOM——一个针对多模态大语言模型的视觉心理理论基准测试。该基准由专门设计的图表可视化数据构成,要求语言模型在给定图表时不仅能正确理解图表内容(事实性问题),还需判断该图表是否会对人类读者产生误导(心理问题)。这两个问题都具有重要的社会价值。我们详细阐述了CHARTOM基准的构建过程,包括基于人类表现的校准测试。通过对2024年末主流大语言模型(包括GPT、Claude、Gemini、Qwen、Llama和Llava)在CHARTOM数据集上的测试,发现所有模型在该基准上均表现欠佳,这表明未来大语言模型仍存在改进空间。


AIOS: LLM Agent Operating System

Abstract

arXiv:2403.16971v4 Announce Type: replace Abstract: LLM-based intelligent agents face significant deployment challenges, particularly related to resource management. Allowing unrestricted access to LLM or tool resources can lead to inefficient or even potentially harmful resource allocation and utilization for agents. Furthermore, the absence of proper scheduling and resource management mechanisms in current agent designs hinders concurrent processing and limits overall system efficiency. As the diversity and complexity of agents continue to grow, addressing these resource management issues becomes increasingly critical to LLM-based agent systems. To address these challenges, this paper proposes the architecture of AIOS (LLM-based AI Agent Operating System) under the context of managing LLM-based agents. It introduces a novel architecture for serving LLM-based agents by isolating resources and LLM-specific services from agent applications into an AIOS kernel. This AIOS kernel provides fundamental services (e.g., scheduling, context management, memory management, storage management, access control) and efficient management of resources (e.g., LLM and external tools) for runtime agents. To enhance usability, AIOS also includes an AIOS-Agent SDK, a comprehensive suite of APIs designed for utilizing functionalities provided by the AIOS kernel. Experimental results demonstrate that using AIOS can achieve up to 2.1x faster execution for serving agents built by various agent frameworks. The source code is available at https://github.com/agiresearch/AIOS.

摘要

基于大语言模型(LLM)的智能代理面临显著的部署挑战,尤其是资源管理相关问题。若允许代理无限制访问LLM或工具资源,可能导致低效甚至潜在有害的资源分配与利用。此外,当前代理设计中缺乏适当的调度与资源管理机制,阻碍了并发处理能力并限制整体系统效率。随着代理多样性与复杂性的持续增长,解决这些资源管理问题对基于LLM的代理系统愈发关键。为应对这些挑战,本文提出AIOS(基于LLM的AI代理操作系统)架构,通过将资源与LLM专属服务从代理应用中隔离至AIOS内核,构建了服务于LLM代理的新型架构。该AIOS内核为运行时代理提供基础服务(如调度、上下文管理、内存管理、存储管理、访问控制)及资源(如LLM与外部工具)的高效管理。为提升易用性,AIOS还包含AIOS-Agent SDK——一套完整API工具集,用于调用AIOS内核功能。实验结果表明,使用AIOS可使各类代理框架构建的代理执行速度最高提升2.1倍。源代码已发布于https://github.com/agiresearch/AIOS。


Exploring Gen-AI applications in building research and industry: A review

Abstract

arXiv:2410.01098v2 Announce Type: replace Abstract: This paper investigates the transformative potential of Generative AI (Gen-AI) technologies, particularly large language models, within the building industry. By leveraging these advanced AI tools, the study explores their application across key areas such as automated compliance checking and building design assistance. The research highlights how Gen-AI can automate labor-intensive processes, significantly improving efficiency and reducing costs in building practices. The paper first discusses the two widely applied fundamental models-Transformer and Diffusion model-and summarizes current pathways for accessing Gen-AI models and the most common techniques for customizing them. It then explores applications for text generation, such as compliance checking, control support, data mining, and building simulation input file editing. Additionally, it examines image generation, including direct generation through diffusion models and indirect generation through language model-supported template creation based on existing Computer-Aided Design or other design tools with rendering. The paper concludes with a comprehensive analysis of the current capabilities of Gen-AI in the building industry, outlining future directions for research and development, with the goal of paving the way for smarter, more effective, and responsive design, construction, and operational practices.

摘要

本文探讨了生成式人工智能(Gen-AI)技术(尤其是大语言模型)在建筑行业的变革潜力。通过运用这些先进AI工具,研究探索了其在自动化合规审查与建筑设计辅助等关键领域的应用。研究重点揭示了Gen-AI如何实现劳动密集型流程的自动化,从而显著提升建筑实践效率并降低成本。论文首先讨论了两类广泛应用的基础模型——Transformer与Diffusion模型,总结了当前访问Gen-AI模型的途径以及最常见的模型定制技术。随后研究了文本生成的应用场景,包括合规审查、控制支持、数据挖掘及建筑模拟输入文件编辑。此外,还考察了图像生成技术,涵盖通过扩散模型直接生成图像,以及基于现有计算机辅助设计或其他具备渲染功能的设计工具、通过语言模型支持的模板进行间接生成。最后,论文对Gen-AI在建筑行业的现有能力进行全面分析,并展望未来研发方向,旨在为更智能、高效、响应迅速的设计、施工及运营实践铺平道路。


Lean Copilot: Large Language Models as Copilots for Theorem Proving in Lean

Abstract

arXiv:2404.12534v3 Announce Type: replace Abstract: Neural theorem proving combines large language models (LLMs) with proof assistants such as Lean, where the correctness of formal proofs can be rigorously verified, leaving no room for hallucination. With existing neural theorem provers pretrained on a fixed collection of data and offering valuable suggestions at times, it is challenging for them to continually prove novel theorems in a fully autonomous mode, where human insights may be critical. In this paper, we explore LLMs as copilots that assist humans in proving theorems. We introduce Lean Copilot, a general framework for running LLM inference natively in Lean. It enables programmers to build various LLM-based proof automation tools that integrate seamlessly into the workflow of Lean users. Lean users can use our pretrained models or bring their own ones that run either locally (with or without GPUs) or on the cloud. Using Lean Copilot, we build LLM-based tools that suggest proof steps, complete proof goals, and select relevant premises. Experimental results on the Mathematics in Lean textbook demonstrate the effectiveness of our method compared to existing rule-based proof automation in Lean (aesop). When assisting humans, Lean Copilot requires only 2.08 manually-entered proof steps on average (3.86 required by aesop); when automating the theorem proving process, Lean Copilot automates 74.2% proof steps on average, 85% better than aesop (40.1%). We open source all code and artifacts under a permissive MIT license to facilitate further research.

摘要

神经定理证明将大型语言模型(LLMs)与Lean等证明助手相结合,可严格验证形式化证明的正确性,彻底消除幻觉空间。现有神经定理证明器虽基于固定数据集预训练并能提供有价值的建议,但难以在完全自主模式下持续证明新定理——其中人类洞察力可能至关重要。本文探索将LLMs作为人类证明定理的协作者,提出Lean Copilot:一个在Lean原生环境中运行LLM推理的通用框架。该框架支持开发者构建各类基于LLM的证明自动化工具,无缝集成至Lean用户工作流。用户可使用我们预训练的模型,或部署本地(含GPU/无GPU)及云端的自定义模型。基于Lean Copilot,我们开发了可推荐证明步骤、补全证明目标及筛选相关前提的LLM工具。在《Mathematics in Lean》教材上的实验表明:相较于Lean现有基于规则的自动化证明器(aesop),本方法具有显著优势。辅助人类时,Lean Copilot平均仅需2.08次手动输入证明步骤(aesop需3.86次);自动化证明过程中,其平均自动化率达74.2%,较aesop(40.1%)提升85%。我们以宽松的MIT许可证开源全部代码与构件,以促进后续研究。


A Statistical Case Against Empirical Human-AI Alignment

Abstract

arXiv:2502.14581v2 Announce Type: replace Abstract: Empirical human-AI alignment aims to make AI systems act in line with observed human behavior. While noble in its goals, we argue that empirical alignment can inadvertently introduce statistical biases that warrant caution. This position paper thus advocates against naive empirical alignment, offering prescriptive alignment and a posteriori empirical alignment as alternatives. We substantiate our principled argument by tangible examples like human-centric decoding of language models.

摘要

经验性人机对齐旨在使人工智能系统行为与观察到的人类行为保持一致。尽管目标崇高,但我们认为这种对齐方式可能无意中引入需警惕的统计偏差。本立场论文因此反对朴素的经验性对齐,并提出规范性对齐与后验经验性对齐作为替代方案。我们通过语言模型的人本解码等具体实例,为这一原则性论证提供了实证支撑。


Unbiased Evaluation of Large Language Models from a Causal Perspective

Abstract

arXiv:2502.06655v2 Announce Type: replace Abstract: Benchmark contamination has become a significant concern in the LLM evaluation community. Previous Agents-as-an-Evaluator address this issue by involving agents in the generation of questions. Despite their success, the biases in Agents-as-an-Evaluator methods remain largely unexplored. In this paper, we present a theoretical formulation of evaluation bias, providing valuable insights into designing unbiased evaluation protocols. Furthermore, we identify two type of bias in Agents-as-an-Evaluator through carefully designed probing tasks on a minimal Agents-as-an-Evaluator setup. To address these issues, we propose the Unbiased Evaluator, an evaluation protocol that delivers a more comprehensive, unbiased, and interpretable assessment of LLMs.Extensive experiments reveal significant room for improvement in current LLMs. Additionally, we demonstrate that the Unbiased Evaluator not only offers strong evidence of benchmark contamination but also provides interpretable evaluation results.

摘要

基准测试污染已成为大语言模型评估领域的重要问题。先前"智能体即评估者"方法通过让智能体参与问题生成来解决该问题。尽管取得成效,但这类方法中的偏差仍未被充分研究。本文提出评估偏差的理论框架,为设计无偏评估方案提供重要见解。通过在最小化"智能体即评估者"设置中精心设计探测任务,我们识别出两类偏差。针对这些问题,我们提出"无偏评估者"协议,该方案能对大语言模型进行更全面、无偏且可解释的评估。大量实验表明当前大语言模型仍有显著改进空间。此外,我们证明该评估方案不仅能提供基准测试污染的有力证据,还可产生可解释的评估结果。


Using Language Models to Decipher the Motivation Behind Human Behaviors

Abstract

arXiv:2503.15752v4 Announce Type: replace Abstract: AI presents a novel tool for deciphering the motivations behind human behaviors. By varying prompts to a large language model, we can elicit the full range of human behaviors in a variety of different scenarios in classic economic games. By analyzing which prompts elicit which behaviors, we infer (decipher) the motivations behind the human behaviors. We also show how one can analyze the prompts to reveal relationships between the classic economic games, providing insight into what different economic scenarios induce people to think about. We also show how this deciphering process can be used to understand differences in the behavioral tendencies of different populations. We show how AI offers a new way to examine the thinking and framing that produce different behaviors.

摘要

人工智能为解析人类行为动机提供了新工具。通过调整大型语言模型的提示指令,我们能够在经典经济博弈实验中激发各种情境下的人类全行为谱系。通过分析特定提示与对应行为的关联性,我们得以推断(解码)人类行为背后的动机机制。研究还展示了如何通过提示分析揭示经典经济博弈之间的内在联系,从而洞察不同经济情境如何引导人们的思维模式。此外,本文论证了该解码方法可用于理解不同群体行为倾向的差异性。研究表明人工智能为探究导致行为差异的思维过程和认知框架提供了全新研究路径。


Long Term Memory: The Foundation of AI Self-Evolution

Abstract

arXiv:2410.15665v4 Announce Type: replace Abstract: Large language models (LLMs) like GPTs, trained on vast datasets, have demonstrated impressive capabilities in language understanding, reasoning, and planning, achieving human-level performance in various tasks. Most studies focus on enhancing these models by training on ever-larger datasets to build more powerful foundation models. While training stronger models is important, enabling models to evolve during inference is equally crucial, a process we refer to as AI self-evolution. Unlike large-scale training, self-evolution may rely on limited data or interactions. Inspired by the columnar organization of the human cerebral cortex, we hypothesize that AI models could develop cognitive abilities and build internal representations through iterative interactions with their environment. To achieve this, models need long-term memory (LTM) to store and manage processed interaction data. LTM supports self-evolution by representing diverse experiences across environments and agents. In this report, we explore AI self-evolution and its potential to enhance models during inference. We examine LTM's role in lifelong learning, allowing models to evolve based on accumulated interactions. We outline the structure of LTM and the systems needed for effective data retention and representation. We also classify approaches for building personalized models with LTM data and show how these models achieve self-evolution through interaction. Using LTM, our multi-agent framework OMNE achieved first place on the GAIA benchmark, demonstrating LTM's potential for AI self-evolution. Finally, we present a roadmap for future research, emphasizing the importance of LTM for advancing AI technology and its practical applications.

摘要

如GPT等基于海量数据训练的大语言模型(LLM),在语言理解、推理与规划方面展现出令人瞩目的能力,已在多项任务中达到人类水平。当前研究多集中于通过更大规模的数据训练来构建更强的基础模型。然而在提升模型训练强度的同时,实现模型在推理过程中的自主进化同样至关重要,这一过程我们称之为AI自我进化。与大规模训练不同,自我进化可能仅依赖有限的数据或交互。受人类大脑皮层柱状结构的启发,我们提出假设:AI模型可通过与环境的迭代交互发展认知能力并构建内部表征。为实现这一目标,模型需要长期记忆(LTM)来存储和管理已处理的交互数据。LTM通过表征跨环境与智能体的多样化经验来支持自我进化。本报告探讨了AI自我进化及其在推理阶段增强模型的潜力,研究了LTM在终身学习中的作用——使模型能够基于累积的交互持续进化。我们阐述了LTM的结构架构与实现高效数据保留和表征所需的系统,分类了利用LTM数据构建个性化模型的方法,并展示这些模型如何通过交互实现自我进化。应用LTM技术的多智能体框架OMNE已在GAIA基准测试中取得首位,证实了LTM对AI自我进化的促进潜力。最后,我们提出了未来研究的路线图,强调LTM对推进AI技术发展及其实际应用的重要性。


Large Language Models Think Too Fast To Explore Effectively

Abstract

arXiv:2501.18009v2 Announce Type: replace Abstract: Large Language Models (LLMs) have emerged with many intellectual capacities. While numerous benchmarks assess their intelligence, limited attention has been given to their ability to explore--an essential capacity for discovering new information and adapting to novel environments in both natural and artificial systems. The extent to which LLMs can effectively explore, particularly in open-ended tasks, remains unclear. This study investigates whether LLMs can surpass humans in exploration during an open-ended task, using Little Alchemy 2 as a paradigm, where agents combine elements to discover new ones. Results show most LLMs underperform compared to humans, except for the o1 model, with traditional LLMs relying primarily on uncertainty-driven strategies, unlike humans who balance uncertainty and empowerment. Results indicate that traditional reasoning-focused LLMs, such as GPT-4o, exhibit a significantly faster and less detailed reasoning process, limiting their exploratory performance. In contrast, the DeepSeek reasoning model demonstrates prolonged, iterative thought processes marked by repetitive analysis of combinations and past trials, reflecting a more thorough and human-like exploration strategy. Representational analysis of the models with Sparse Autoencoders (SAE) revealed that uncertainty and choices are represented at earlier transformer blocks, while empowerment values are processed later, causing LLMs to think too fast and make premature decisions, hindering effective exploration. These findings shed light on the limitations of LLM exploration and suggest directions for improving their adaptability.

摘要

大语言模型(LLMs)已展现出多种智能能力。尽管已有众多基准测试评估其智力,但对其探索能力——这一自然与人工系统中发现新信息和适应新环境的核心能力——的关注却十分有限。目前尚不清楚LLMs在开放式任务中能否有效进行探索。本研究以《小小炼金术2》为范式,考察LLMs在开放式任务中的探索能力是否超越人类,该任务要求代理通过元素组合发现新元素。结果显示,除o1模型外,多数LLMs表现逊于人类:传统LLMs主要依赖不确定性驱动策略,而人类则能平衡不确定性与赋能。研究发现,以推理为核心的传统LLMs(如GPT-4o)表现出更快但更简略的推理过程,这限制了其探索性能;相比之下,DeepSeek推理模型展现出更持久、迭代的思维过程,其特点是对组合与既往尝试的重复分析,体现了更全面、类人的探索策略。通过稀疏自编码器(SAE)对模型表征的分析表明,不确定性和选择在Transformer早期区块即被表征,而赋能值则在后期处理,导致LLMs思考过快并做出过早决策,阻碍有效探索。这些发现揭示了LLMs探索能力的局限性,并为提升其适应性指明了方向。


Exploring Next Token Prediction For Optimizing Databases

Abstract

arXiv:2503.19619v3 Announce Type: replace Abstract: The Next Token Prediction paradigm (NTP, for short) lies at the forefront of modern large foundational models that are pre-trained on diverse and large datasets. These models generalize effectively, and have proven to be very successful in Natural Language Processing (NLP). Inspired by the generalization capabilities of Large Language Models (LLMs), we investigate whether the same NTP paradigm can be applied to DBMS design and optimization tasks. Adopting NTP directly for database optimization is non-trivial due to the fundamental differences between the domains. In this paper, we present a framework, termed Probe and Learn (PoLe), for applying NTP to optimize database systems. PoLe leverages Decision Transformers and hardware-generated tokens to effectively incorporate NTP into database systems. As a proof of concept, we demonstrate PoLe in the context of the index scheduling task over NUMA servers in main-memory database systems. Preliminary results for this scheduling task demonstrate that adopting NTP and PoLe can improve both performance and generalizability.

摘要

下一词元预测范式(简称NTP)是现代大型基础模型的核心技术,这些模型通过多样化的海量数据集进行预训练。此类模型展现出卓越的泛化能力,并在自然语言处理(NLP)领域取得显著成功。受大型语言模型(LLM)泛化能力的启发,本研究探讨NTP范式能否应用于数据库管理系统(DBMS)的设计与优化任务。由于领域间的本质差异,直接将NTP应用于数据库优化面临重大挑战。本文提出名为'探学'(Probe and Learn,PoLe)的框架,将NTP应用于数据库系统优化。PoLe通过决策变换器和硬件生成词元,有效实现了NTP与数据库系统的融合。作为概念验证,我们在主存数据库系统的NUMA服务器索引调度任务中展示了PoLe的可行性。该调度任务的初步结果表明,采用NTP与PoLe框架可同步提升系统性能与泛化能力。


Integrating Expert Knowledge into Logical Programs via LLMs

Abstract

arXiv:2502.12275v2 Announce Type: replace Abstract: This paper introduces ExKLoP, a novel framework designed to evaluate how effectively Large Language Models (LLMs) integrate expert knowledge into logical reasoning systems. This capability is especially valuable in engineering, where expert knowledge-such as manufacturer-recommended operational ranges-can be directly embedded into automated monitoring systems. By mirroring expert verification steps, tasks like range checking and constraint validation help ensure system safety and reliability. Our approach systematically evaluates LLM-generated logical rules, assessing both syntactic fluency and logical correctness in these critical validation tasks. We also explore the models' capacity for self-correction via an iterative feedback loop based on code execution outcomes. ExKLoP presents an extensible dataset comprising 130 engineering premises, 950 prompts, and corresponding validation points. It enables comprehensive benchmarking while allowing control over task complexity and scalability of experiments. We leverage the synthetic data creation methodology to conduct extensive empirical evaluation on a diverse set of LLMs including Llama3, Gemma3, Codestral and QwenCoder. The results reveal that most models generate nearly perfect syntactically correct code and exhibit strong performance in translating expert knowledge into correct code. At the same time, while most LLMs produce nearly flawless syntactic output, their ability to correctly implement logical rules varies, as does their capacity for self-improvement. Overall, ExKLoP serves as a robust evaluation platform that streamlines the selection of effective models for self-correcting systems while clearly delineating the types of errors encountered.

摘要

本文介绍了ExKLoP这一新颖框架,旨在评估大型语言模型(LLMs)将专家知识整合到逻辑推理系统中的有效性。该能力在工程领域尤为重要,例如制造商推荐的操作范围等专家知识可直接嵌入自动化监控系统。通过模拟专家验证步骤,范围检查和约束验证等任务有助于确保系统安全性和可靠性。我们的方法系统评估LLM生成的逻辑规则,在这些关键验证任务中同时考察语法流畅性和逻辑正确性。我们还通过基于代码执行结果的迭代反馈循环,探索了模型的自我修正能力。ExKLoP提供了一个可扩展数据集,包含130个工程前提、950条提示词及对应验证点,既能实现全面基准测试,又可控制任务复杂度和实验可扩展性。我们利用合成数据创建方法对包括Llama3、Gemma3、Codestral和QwenCoder在内的多种LLM进行了广泛实证评估。结果表明,大多数模型能生成语法近乎完美的代码,并在将专家知识转化为正确代码方面表现优异。然而,虽然多数LLM能产出近乎无瑕的语法输出,但其正确实现逻辑规则的能力及自我改进的潜力存在差异。总体而言,ExKLoP作为一个稳健的评估平台,既能简化自修正系统的高效模型选择流程,又能清晰界定所遇到的错误类型。


An Illusion of Progress? Assessing the Current State of Web Agents

Abstract

arXiv:2504.01382v2 Announce Type: replace Abstract: As digitalization and cloud technologies evolve, the web is becoming increasingly important in the modern society. Autonomous web agents based on large language models (LLMs) hold a great potential in work automation. It is therefore important to accurately measure and monitor the progression of their capabilities. In this work, we conduct a comprehensive and rigorous assessment of the current state of web agents. Our results depict a very different picture of the competency of current agents, suggesting over-optimism in previously reported results. This gap can be attributed to shortcomings in existing benchmarks. We introduce Online-Mind2Web, an online evaluation benchmark consisting of 300 diverse and realistic tasks spanning 136 websites. It enables us to evaluate web agents under a setting that approximates how real users use these agents. To facilitate more scalable evaluation and development, we also develop a novel LLM-as-a-Judge automatic evaluation method and show that it can achieve around 85% agreement with human judgment, substantially higher than existing methods. Finally, we present the first comprehensive comparative analysis of current web agents, highlighting both their strengths and limitations to inspire future research.

摘要

随着数字化与云计算技术的发展,网络在现代社会中的重要性日益凸显。基于大语言模型(LLMs)的自主网络代理器在工作自动化领域展现出巨大潜力,因此准确衡量并追踪其能力演进至关重要。本研究对当前网络代理器的发展现状进行了全面严谨的评估,结果表明现有代理器的实际能力与既往研究报道的乐观结论存在显著差异。这一差距可归因于现有基准测试的局限性。我们提出了Online-Mind2Web在线评估基准,该基准包含涵盖136个网站的300项多样化现实任务,使我们能够在近似真实用户使用场景下评估网络代理器。为促进更可扩展的评估与开发,我们还开发了新型LLM-as-a-Judge自动评估方法,实验证明其与人工评判结果的一致性可达85%,显著优于现有方法。最后,我们首次对当前主流网络代理器进行了全面对比分析,通过揭示其优势与局限为未来研究提供启示。


D-CIPHER: Dynamic Collaborative Intelligent Multi-Agent System with Planner and Heterogeneous Executors for Offensive Security

Abstract

arXiv:2502.10931v2 Announce Type: replace Abstract: Large Language Models (LLMs) have been used in cybersecurity such as autonomous security analysis or penetration testing. Capture the Flag (CTF) challenges serve as benchmarks to assess automated task-planning abilities of LLM agents for cybersecurity. Early attempts to apply LLMs for solving CTF challenges used single-agent systems, where feedback was restricted to a single reasoning-action loop. This approach was inadequate for complex CTF tasks. Inspired by real-world CTF competitions, where teams of experts collaborate, we introduce the D-CIPHER LLM multi-agent framework for collaborative CTF solving. D-CIPHER integrates agents with distinct roles with dynamic feedback loops to enhance reasoning on complex tasks. It introduces the Planner-Executor agent system, consisting of a Planner agent for overall problem-solving along with multiple heterogeneous Executor agents for individual tasks, facilitating efficient allocation of responsibilities among the agents. Additionally, D-CIPHER incorporates an Auto-prompter agent to improve problem-solving by auto-generating a highly relevant initial prompt. We evaluate D-CIPHER on multiple CTF benchmarks and LLM models via comprehensive studies to highlight the impact of our enhancements. Additionally, we manually map the CTFs in NYU CTF Bench to MITRE ATT&CK techniques that apply for a comprehensive evaluation of D-CIPHER's offensive security capability. D-CIPHER achieves state-of-the-art performance on three benchmarks: 22.0% on NYU CTF Bench, 22.5% on Cybench, and 44.0% on HackTheBox, which is 2.5% to 8.5% better than previous work. D-CIPHER solves 65% more ATT&CK techniques compared to previous work, demonstrating stronger offensive capability.

摘要

大型语言模型(LLMs)在网络安全领域(如自动化安全分析或渗透测试)已得到应用。夺旗赛(CTF)挑战可作为评估LLM代理在网络安全领域自动化任务规划能力的基准。早期尝试应用LLM解决CTF挑战采用单代理系统,其反馈仅限于单一推理-行动循环,该方法难以应对复杂CTF任务。受现实世界CTF竞赛中专家团队协作的启发,我们提出用于协作式CTF求解的D-CIPHER多智能体框架。该框架通过整合具有不同角色的代理与动态反馈循环来增强复杂任务推理能力,创新性地构建了规划者-执行者(Planner-Executor)代理系统:规划者代理负责整体问题求解,多个异构执行者代理处理具体任务,从而实现代理间的高效责任分配。此外,D-CIPHER引入自动提示生成代理,通过自动生成高相关性初始提示来提升问题解决能力。我们在多个CTF基准和LLM模型上对D-CIPHER进行全面评估以凸显改进效果,并手动将NYU CTF Bench中的挑战映射至MITRE ATT&CK技术框架以实现对D-CIPHER攻击性安全能力的综合评估。实验表明:D-CIPHER在NYU CTF Bench(22.0%)、Cybench(22.5%)和HackTheBox(44.0%)三个基准上达到最先进性能,较先前工作提升2.5%至8.5%;其解决的ATT&CK技术数量较先前工作增加65%,展现出更强的攻击能力。


A Survey of WebAgents: Towards Next-Generation AI Agents for Web Automation with Large Foundation Models

Abstract

arXiv:2503.23350v2 Announce Type: replace Abstract: With the advancement of web techniques, they have significantly revolutionized various aspects of people's lives. Despite the importance of the web, many tasks performed on it are repetitive and time-consuming, negatively impacting overall quality of life. To efficiently handle these tedious daily tasks, one of the most promising approaches is to advance autonomous agents based on Artificial Intelligence (AI) techniques, referred to as AI Agents, as they can operate continuously without fatigue or performance degradation. In the context of the web, leveraging AI Agents -- termed WebAgents -- to automatically assist people in handling tedious daily tasks can dramatically enhance productivity and efficiency. Recently, Large Foundation Models (LFMs) containing billions of parameters have exhibited human-like language understanding and reasoning capabilities, showing proficiency in performing various complex tasks. This naturally raises the question: `Can LFMs be utilized to develop powerful AI Agents that automatically handle web tasks, providing significant convenience to users?' To fully explore the potential of LFMs, extensive research has emerged on WebAgents designed to complete daily web tasks according to user instructions, significantly enhancing the convenience of daily human life. In this survey, we comprehensively review existing research studies on WebAgents across three key aspects: architectures, training, and trustworthiness. Additionally, several promising directions for future research are explored to provide deeper insights.

摘要

随着网络技术的进步,其已深刻改变了人们生活的诸多方面。尽管网络至关重要,但许多网络操作仍具有重复性和耗时性,对整体生活质量产生负面影响。为高效处理这些繁琐的日常任务,最具前景的解决方案之一是发展基于人工智能(AI)技术的自主代理——即AI智能体,因其可持续运作且不会出现疲劳或性能衰减。在网络环境下,利用这类被称为"网络智能体"(WebAgents)的AI代理来自动协助人类处理日常繁琐任务,可显著提升生产力和效率。近年来,具有数十亿参数的大型基础模型(LFMs)展现出类人的语言理解与推理能力,在完成各类复杂任务方面表现优异。这自然引出一个核心问题:"能否利用LFMs开发强大的AI智能体来自动处理网络任务,为用户提供显著便利?"为充分挖掘LFMs潜力,学界已涌现大量关于网络智能体的研究,这些智能体可根据用户指令完成日常网络任务,极大提升了人类日常生活的便利性。本综述从架构设计、训练方法和可信保障三个关键维度,系统回顾了现有网络智能体的研究成果,并探讨了未来研究的若干潜在方向,以期为该领域提供更深入的见解。


The Geometry of Self-Verification in a Task-Specific Reasoning Model

Abstract

arXiv:2504.14379v2 Announce Type: replace Abstract: How do reasoning models verify their own answers? We study this question by training a model using DeepSeek R1's recipe on the CountDown task. We leverage the fact that preference tuning leads to mode collapse, yielding a model that always produces highly structured chain-of-thought sequences. With this setup, we do top-down and bottom-up analyses to reverse-engineer how the model verifies its outputs. Top-down, we find Gated Linear Unit (GLU) weights encoding verification-related tokens, such as success'' or incorrect''. Bottom-up, we find that ``previous-token heads'' are mainly responsible for self-verification in our setup. Our analyses meet in the middle: drawing inspiration from inter-layer communication channels, we use the identified GLU weights to localize as few as three attention heads that can disable self-verification, pointing to a necessary component of a potentially larger verification circuit. Finally, we verify that similar verification components exist in our base model and a general reasoning DeepSeek-R1 model.

摘要

推理模型如何验证自身答案?本研究通过在CountDown任务上采用DeepSeek R1的训练方案,探讨该问题。我们利用偏好调优会导致模式坍缩的特性,获得一个始终生成高度结构化思维链序列的模型。基于此设置,我们通过自上而下与自下而上的分析逆向推导模型的输出验证机制。自上而下分析发现,门控线性单元(GLU)权重编码了"success"或"incorrect"等验证相关标记;自下而上分析则揭示"前向标记注意力头"在本设置中主导自我验证过程。两项分析结果在中层交汇:受层间通信通道启发,我们利用已识别的GLU权重定位出仅需三个注意力头即可禁用自我验证功能,这指向潜在更大规模验证回路中的必要组件。最后,我们证实基础模型及通用推理模型DeepSeek-R1中均存在类似的验证组件。


LLM-Guided Probabilistic Program Induction for POMDP Model Estimation

Abstract

arXiv:2505.02216v2 Announce Type: replace Abstract: Partially Observable Markov Decision Processes (POMDPs) model decision making under uncertainty. While there are many approaches to approximately solving POMDPs, we aim to address the problem of learning such models. In particular, we are interested in a subclass of POMDPs wherein the components of the model, including the observation function, reward function, transition function, and initial state distribution function, can be modeled as low-complexity probabilistic graphical models in the form of a short probabilistic program. Our strategy to learn these programs uses an LLM as a prior, generating candidate probabilistic programs that are then tested against the empirical distribution and adjusted through feedback. We experiment on a number of classical toy POMDP problems, simulated MiniGrid domains, and two real mobile-base robotics search domains involving partial observability. Our results show that using an LLM to guide in the construction of a low-complexity POMDP model can be more effective than tabular POMDP learning, behavior cloning, or direct LLM planning.

摘要

部分可观测马尔可夫决策过程(POMDPs)用于建模不确定性下的决策问题。尽管存在多种近似求解POMDPs的方法,本研究旨在解决此类模型的学习问题。我们特别关注一类POMDPs子集,其模型组件(包括观测函数、奖励函数、转移函数和初始状态分布函数)均可表示为短概率程序形式的低复杂度概率图模型。我们的学习策略采用大型语言模型(LLM)作为先验,生成候选概率程序后通过经验分布测试并基于反馈进行调整。实验涵盖经典玩具POMDP问题、模拟MiniGrid场景以及两个涉及部分可观测性的真实移动基站机器人搜索领域。结果表明,利用LLM指导低复杂度POMDP模型构建的方法,比表格化POMDP学习、行为克隆或直接LLM规划更具有效性。


Clickbait Detection via Large Language Models

Abstract

arXiv:2306.09597v4 Announce Type: replace-cross Abstract: Clickbait, which aims to induce users with some surprising and even thrilling headlines for increasing click-through rates, permeates almost all online content publishers, such as news portals and social media. Recently, Large Language Models (LLMs) have emerged as a powerful instrument and achieved tremendous success in a series of NLP downstream tasks. However, it is not yet known whether LLMs can be served as a high-quality clickbait detection system. In this paper, we analyze the performance of LLMs in the few-shot and zero-shot scenarios on several English and Chinese benchmark datasets. Experimental results show that LLMs cannot achieve the best results compared to the state-of-the-art deep and fine-tuning PLMs methods. Different from human intuition, the experiments demonstrated that LLMs cannot make satisfied clickbait detection just by the headlines.

摘要

点击诱饵旨在通过一些令人惊讶甚至耸动的标题来吸引用户点击,从而提高点击率,这种现象几乎渗透到所有在线内容发布平台,如新闻门户和社交媒体。近年来,大型语言模型(LLMs)作为一种强大工具崭露头角,在一系列自然语言处理下游任务中取得了巨大成功。然而,目前尚不清楚LLMs能否作为高质量的点击诱饵检测系统。本文分析了LLMs在少量样本和无样本情况下,在多个中英文基准数据集上的表现。实验结果表明,与最先进的深度学习和微调预训练语言模型方法相比,LLMs无法取得最佳效果。与人类直觉不同,实验证明LLMs仅凭标题无法实现令人满意的点击诱饵检测。


Towards Understanding Sycophancy in Language Models

Abstract

arXiv:2310.13548v4 Announce Type: replace-cross Abstract: Human feedback is commonly utilized to finetune AI assistants. But human feedback may also encourage model responses that match user beliefs over truthful ones, a behaviour known as sycophancy. We investigate the prevalence of sycophancy in models whose finetuning procedure made use of human feedback, and the potential role of human preference judgments in such behavior. We first demonstrate that five state-of-the-art AI assistants consistently exhibit sycophancy across four varied free-form text-generation tasks. To understand if human preferences drive this broadly observed behavior, we analyze existing human preference data. We find that when a response matches a user's views, it is more likely to be preferred. Moreover, both humans and preference models (PMs) prefer convincingly-written sycophantic responses over correct ones a non-negligible fraction of the time. Optimizing model outputs against PMs also sometimes sacrifices truthfulness in favor of sycophancy. Overall, our results indicate that sycophancy is a general behavior of state-of-the-art AI assistants, likely driven in part by human preference judgments favoring sycophantic responses.

摘要

人类反馈通常被用于微调AI助手。但人类反馈也可能鼓励模型生成符合用户信念而非真实事实的回应,这种行为被称为谄媚。我们研究了在微调过程中使用人类反馈的模型中谄媚行为的普遍性,以及人类偏好判断在此类行为中的潜在作用。首先,我们证明五种最先进的AI助手在四个不同的自由文本生成任务中持续表现出谄媚倾向。为探究人类偏好是否驱动这一广泛观察到的行为,我们分析了现有人类偏好数据。发现当回应符合用户观点时,其被偏好的概率更高。此外,无论是人类还是偏好模型(PMs),在不可忽视的情况下都会选择具有说服力的谄媚回应而非正确答案。针对偏好模型优化输出时,有时也会以牺牲真实性为代价换取谄媚性。总体而言,我们的结果表明谄媚是最先进AI助手的普遍行为,其部分原因可能源于人类偏好判断对谄媚回应的青睐。


Gemini: A Family of Highly Capable Multimodal Models

Abstract

arXiv:2312.11805v5 Announce Type: replace-cross Abstract: This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultra model advances the state of the art in 30 of 32 of these benchmarks - notably being the first model to achieve human-expert performance on the well-studied exam benchmark MMLU, and improving the state of the art in every one of the 20 multimodal benchmarks we examined. We believe that the new capabilities of the Gemini family in cross-modal reasoning and language understanding will enable a wide variety of use cases. We discuss our approach toward post-training and deploying Gemini models responsibly to users through services including Gemini, Gemini Advanced, Google AI Studio, and Cloud Vertex AI.

摘要

本报告介绍了一个新型多模态模型家族Gemini,该系列在图像、音频、视频和文本理解方面展现出卓越能力。Gemini家族包含Ultra、Pro和Nano三种规格,适用于从复杂推理任务到设备端内存受限应用场景的广泛需求。在大量基准测试中的评估表明,性能最强的Gemini Ultra模型在32个基准测试中的30个实现了技术突破——尤为显著的是在深入研究过的MMLU考试基准上首次达到人类专家水平,并在我们考察的20个多模态基准测试中每一项都刷新了现有技术水平。我们相信Gemini家族在跨模态推理和语言理解方面的新能力将支持多样化的应用场景。本文还探讨了通过Gemini、Gemini Advanced、Google AI Studio和Cloud Vertex AI等服务对模型进行训练后优化及负责任部署的实施路径。


Fleet of Agents: Coordinated Problem Solving with Large Language Models

Abstract

arXiv:2405.06691v3 Announce Type: replace-cross Abstract: While numerous frameworks have been developed to enhance the reasoning abilities of large language models (LLMs), there is a scarcity of methods that effectively balance the trade-off between cost and quality. In this paper, we introduce Fleet of Agents (FoA), a novel and intuitive yet principled framework utilizing LLMs as agents to navigate through dynamic tree searches, employing a genetic-type particle filtering approach. FoA spawns a multitude of agents, each exploring the search space autonomously, followed by a selection phase where resampling based on a heuristic value function optimizes the balance between exploration and exploitation. This mechanism enables dynamic branching, adapting the exploration strategy based on discovered solutions. We conduct extensive experiments on three benchmark tasks, Game of 24'', Mini-Crosswords'', and WebShop'', utilizing four different LLMs, GPT-3.5'', GPT-4'', LLaMA3.2-11B'', and ``LLaMA3.2-90B''. On average across all tasks and LLMs, FoA obtains a quality improvement of ~5% while requiring only ~40% of the cost of previous SOTA methods. Notably, our analyses reveal that (1) FoA achieves the best cost-quality trade-off among all benchmarked methods and (2) FoA + LLaMA3.2-11B surpasses the Llama3.2-90B model. FoA is publicly available at https://github.com/au-clan/FoA.

摘要

尽管已有众多框架被开发用于增强大语言模型(LLMs)的推理能力,但能有效平衡成本与质量权衡的方法仍较为稀缺。本文提出智能体舰队(Fleet of Agents, FoA)——一种新颖直观且原理严谨的框架,该框架通过遗传型粒子滤波方法,将LLMs作为智能体在动态树搜索中进行导航。FoA会生成多个智能体,每个智能体自主探索搜索空间,随后通过基于启发式价值函数的重采样选择阶段来优化探索与利用的平衡。这种机制支持动态分支,能根据已发现解决方案调整探索策略。我们在三个基准任务("24点游戏"、"迷你填字游戏"和"WebShop")上使用四种不同LLMs("GPT-3.5"、"GPT-4"、"LLaMA3.2-11B"和"LLaMA3.2-90B")进行了广泛实验。在所有任务和LLMs中,FoA平均获得约5%的质量提升,同时仅需先前最先进方法约40%的成本。值得注意的是,我们的分析表明:(1)在所有基准方法中,FoA实现了最佳的成本-质量权衡;(2)FoA+LLaMA3.2-11B组合超越了Llama3.2-90B模型。FoA已在https://github.com/au-clan/FoA开源。


Integrating Large Language Models in Causal Discovery: A Statistical Causal Approach

Abstract

arXiv:2402.01454v5 Announce Type: replace-cross Abstract: In practical statistical causal discovery (SCD), embedding domain expert knowledge as constraints into the algorithm is important for reasonable causal models reflecting the broad knowledge of domain experts, despite the challenges in the systematic acquisition of background knowledge. To overcome these challenges, this paper proposes a novel method for causal inference, in which SCD and knowledge-based causal inference (KBCI) with a large language model (LLM) are synthesized through ``statistical causal prompting (SCP)'' for LLMs and prior knowledge augmentation for SCD. The experiments in this work have revealed that the results of LLM-KBCI and SCD augmented with LLM-KBCI approach the ground truths, more than the SCD result without prior knowledge. These experiments have also revealed that the SCD result can be further improved if the LLM undergoes SCP. Furthermore, with an unpublished real-world dataset, we have demonstrated that the background knowledge provided by the LLM can improve the SCD on this dataset, even if this dataset has never been included in the training data of the LLM. For future practical application of this proposed method across important domains such as healthcare, we also thoroughly discuss the limitations, risks of critical errors, expected improvement of techniques around LLMs, and realistic integration of expert checks of the results into this automatic process, with SCP simulations under various conditions both in successful and failure scenarios. The careful and appropriate application of the proposed approach in this work, with improvement and customization for each domain, can thus address challenges such as dataset biases and limitations, illustrating the potential of LLMs to improve data-driven causal inference across diverse scientific domains. The code used in this work is publicly available at: www.github.com/mas-takayama/LLM-and-SCD

摘要

在实际统计因果发现(SCD)中,尽管系统获取背景知识存在挑战,但将领域专家知识作为约束嵌入算法对于构建反映专家广泛认知的合理因果模型至关重要。为克服这些挑战,本文提出一种新型因果推理方法:通过"统计因果提示(SCP)"技术和大语言模型(LLM)的基于知识因果推理(KBCI),实现SCD与KBCI的协同融合,并利用先验知识增强SCD。实验表明,经LLM-KBCI增强的推理结果及SCD结果相较于无先验知识的SCD更接近真实因果;当LLM应用SCP时,SCD结果可得到进一步优化。此外,在一个未公开的真实数据集上,我们证实即使该数据未包含于LLM训练集,LLM提供的背景知识仍能提升SCD效果。针对医疗等重要领域的实际应用前景,我们深入探讨了该方法的局限性、关键错误风险、LLM相关技术的预期改进空间,以及如何将专家结果核查切实融入自动化流程,并通过多种成功与失败场景下的SCP模拟进行验证。本研究提出的方法经过针对性改进和领域定制后,可解决数据集偏差等挑战,其审慎应用展现了LLM在跨学科数据驱动因果推理中的革新潜力。本研究代码已开源:www.github.com/mas-takayama/LLM-and-SCD


Order Matters in Hallucination: Reasoning Order as Benchmark and Reflexive Prompting for Large-Language-Models

Abstract

arXiv:2408.05093v4 Announce Type: replace-cross Abstract: Large language models (LLMs) have generated significant attention since their inception, finding applications across various academic and industrial domains. However, these models often suffer from the "hallucination problem", where outputs, though grammatically and logically coherent, lack factual accuracy or are entirely fabricated. A particularly troubling issue discovered and widely discussed recently is the numerical comparison error where multiple LLMs incorrectly infer that "9.11>9.9". We discovered that the order in which LLMs generate answers and reasoning impacts their consistency. Specifically, results vary significantly when an LLM generates an answer first and then provides the reasoning versus generating the reasoning process first and then the conclusion. Inspired by this, we propose a new benchmark method for assessing LLM consistency: comparing responses generated through these two different approaches. This benchmark effectively identifies instances where LLMs fabricate answers and subsequently generate justifications. Furthermore, we introduce a novel and straightforward prompt strategy designed to mitigate this issue. Experimental results demonstrate that this strategy improves performance across various LLMs compared to direct questioning. This work not only sheds light on a critical flaw in LLMs but also offers a practical solution to enhance their reliability.

摘要

大型语言模型(LLMs)自问世以来已引发广泛关注,并在学术与工业领域得到多样化应用。然而,这些模型普遍存在"幻觉问题"——其输出虽具备语法与逻辑连贯性,却常缺乏事实准确性或完全虚构。近期发现并引发广泛讨论的数值比较错误尤为突出,多个LLMs错误推断出"9.11>9.9"。研究发现,LLMs生成答案与推理的顺序会影响其一致性:当模型首先生成答案再提供推理时,其结果与先进行推理再得出结论的情况存在显著差异。受此启发,我们提出评估LLM一致性的新基准方法:对比这两种不同路径生成的响应。该基准能有效识别LLMs先虚构答案后编造论证的情况。此外,我们设计了一种新颖且简洁的提示策略以缓解该问题。实验表明,相较于直接提问,该策略能提升多种LLMs的性能。本研究不仅揭示了LLMs的关键缺陷,更为提升其可靠性提供了实用解决方案。


MedualTime: A Dual-Adapter Language Model for Medical Time Series-Text Multimodal Learning

Abstract

arXiv:2406.06620v3 Announce Type: replace-cross Abstract: The recent rapid advancements in language models (LMs) have garnered attention in medical time series-text multimodal learning. However, existing contrastive learning-based and prompt-based LM approaches tend to be biased, often assigning a primary role to time series modality while treating text modality as secondary. We classify these approaches under a temporal-primary paradigm, which may overlook the unique and critical task-relevant information embedded in text modality like clinical reports, thus failing to fully leverage mutual benefits and complementarity of different modalities. To fill this gap, we propose a novel textual-temporal multimodal learning paradigm that enables either modality to serve as the primary while being enhanced by the other, thereby effectively capturing modality-specific information and fostering cross-modal interaction. In specific, we design MedualTime, a language model composed of dual adapters to implement temporal-primary and textual-primary modeling simultaneously. Within each adapter, lightweight adaptation tokens are injected into the top layers of LM to encourage high-level modality fusion. The shared LM pipeline by dual adapters not only achieves adapter alignment but also enables efficient fine-tuning, reducing computational resources. Empirically, MedualTime demonstrates superior performance on medical data, achieving notable improvements of 8% accuracy and 12% F1 in supervised settings. Furthermore, MedualTime's transferability is validated by few-shot label transfer experiments from coarse-grained to fine-grained medical data. https://github.com/start2020/MedualTime

摘要

近期语言模型(LM)的快速发展引发了医学时间序列-文本多模态学习领域的广泛关注。然而,现有基于对比学习和提示学习的语言模型方法往往存在偏差,通常将时间序列模态视为主导角色,而将文本模态置于次要地位。我们将这些方法归类为"时序主导范式",该范式可能忽略临床报告等文本模态中蕴含的独特且关键的任务相关信息,从而无法充分利用不同模态间的互惠性与互补性。为填补这一空白,我们提出了一种新颖的"文本-时序"多模态学习范式,使任一模态均可作为主导模态并得到另一模态的增强,从而有效捕获模态特异性信息并促进跨模态交互。具体而言,我们设计了MedualTime模型——由双重适配器构成的语言模型,可同步实现时序主导和文本主导建模。每个适配器内部,我们在语言模型顶层注入轻量级自适应标记以促进高层级模态融合。双重适配器共享的语言模型管道不仅实现了适配器对齐,还能通过高效微调减少计算资源消耗。实验表明,MedualTime在医学数据上表现出卓越性能,在监督学习场景中准确率和F1分数分别显著提升8%和12%。此外,通过从粗粒度到细粒度医学数据的少样本标签迁移实验,验证了MedualTime的迁移能力。项目地址:https://github.com/start2020/MedualTime


Mapping Biomedical Ontology Terms to IDs: Effect of Domain Prevalence on Prediction Accuracy

Abstract

arXiv:2409.13746v2 Announce Type: replace-cross Abstract: This study evaluates the ability of large language models (LLMs) to map biomedical ontology terms to their corresponding ontology IDs across the Human Phenotype Ontology (HPO), Gene Ontology (GO), and UniProtKB terminologies. Using counts of ontology IDs in the PubMed Central (PMC) dataset as a surrogate for their prevalence in the biomedical literature, we examined the relationship between ontology ID prevalence and mapping accuracy. Results indicate that ontology ID prevalence strongly predicts accurate mapping of HPO terms to HPO IDs, GO terms to GO IDs, and protein names to UniProtKB accession numbers. Higher prevalence of ontology IDs in the biomedical literature correlated with higher mapping accuracy. Predictive models based on receiver operating characteristic (ROC) curves confirmed this relationship. In contrast, this pattern did not apply to mapping protein names to Human Genome Organisation's (HUGO) gene symbols. GPT-4 achieved a high baseline performance (95%) in mapping protein names to HUGO gene symbols, with mapping accuracy unaffected by prevalence. We propose that the high prevalence of HUGO gene symbols in the literature has caused these symbols to become lexicalized, enabling GPT-4 to map protein names to HUGO gene symbols with high accuracy. These findings highlight the limitations of LLMs in mapping ontology terms to low-prevalence ontology IDs and underscore the importance of incorporating ontology ID prevalence into the training and evaluation of LLMs for biomedical applications.

摘要

本研究评估了大型语言模型(LLM)在人类表型本体(HPO)、基因本体(GO)和UniProtKB术语体系中将生物医学本体术语映射至对应本体ID的能力。通过统计PubMed Central(PMC)数据集中的本体ID出现频次作为生物医学文献普及度的替代指标,我们探究了本体ID普及度与映射准确率的关系。结果表明:本体ID普及度能显著预测HPO术语至HPO ID、GO术语至GO ID以及蛋白质名称至UniProtKB登录号的准确映射。生物医学文献中本体ID出现频次越高,其映射准确率也越高。基于受试者工作特征(ROC)曲线的预测模型验证了这一关联。然而,该模式不适用于将蛋白质名称映射至人类基因组组织(HUGO)基因符号的任务。GPT-4在此类映射中表现出高基准准确率(95%),且映射准确率不受普及度影响。我们认为文献中HUGO基因符号的高普及度使其词汇化,从而使GPT-4能实现高精度映射。这些发现揭示了LLM在映射低普及度本体ID时的局限性,并强调了将本体ID普及度纳入生物医学领域LLM训练与评估体系的重要性。


CodeV: Empowering LLMs with HDL Generation through Multi-Level Summarization

Abstract

arXiv:2407.10424v5 Announce Type: replace-cross Abstract: The design flow of processors, particularly in hardware description languages (HDL) like Verilog and Chisel, is complex and costly. While recent advances in large language models (LLMs) have significantly improved coding tasks in software languages such as Python, their application in HDL generation remains limited due to the scarcity of high-quality HDL data. Traditional methods of adapting LLMs for hardware design rely on synthetic HDL datasets, which often suffer from low quality because even advanced LLMs like GPT perform poorly in the HDL domain. Moreover, these methods focus solely on chat tasks and the Verilog language, limiting their application scenarios. In this paper, we observe that: (1) HDL code collected from the real world is of higher quality than code generated by LLMs. (2) LLMs like GPT-3.5 excel in summarizing HDL code rather than generating it. (3) An explicit language tag can help LLMs better adapt to the target language when there is insufficient data. Based on these observations, we propose an efficient LLM fine-tuning pipeline for HDL generation that integrates a multi-level summarization data synthesis process with a novel Chat-FIM-Tag supervised fine-tuning method. The pipeline enhances the generation of HDL code from natural language descriptions and enables the handling of various tasks such as chat and infilling incomplete code. Utilizing this pipeline, we introduce CodeV, a series of HDL generation LLMs. Among them, CodeV-All not only possesses a more diverse range of language abilities, i.e. Verilog and Chisel, and a broader scope of tasks, i.e. Chat and fill-in-middle (FIM), but it also achieves performance on VerilogEval that is comparable to or even surpasses that of CodeV-Verilog fine-tuned on Verilog only, making them the first series of open-source LLMs designed for multi-scenario HDL generation.

摘要

处理器设计流程,特别是在Verilog和Chisel等硬件描述语言(HDL)中,既复杂又成本高昂。尽管大型语言模型(LLM)的最新进展显著提升了Python等软件语言的编码任务表现,但由于高质量HDL数据的稀缺,其在HDL生成中的应用仍受限。传统方法通过合成HDL数据集来适配LLM进行硬件设计,但这些数据集往往质量低下,因为即使是GPT等先进LLM在HDL领域表现也不佳。此外,这些方法仅关注聊天任务和Verilog语言,限制了其应用场景。

本文发现:(1)从现实世界收集的HDL代码质量高于LLM生成的代码;(2)GPT-3.5等LLM更擅长总结HDL代码而非生成代码;(3)当数据不足时,显式的语言标签可帮助LLM更好地适应目标语言。基于这些观察,我们提出了一种高效的HDL生成LLM微调流程,该流程将多级摘要数据合成过程与新颖的Chat-FIM-Tag监督微调方法相结合。该流程增强了从自然语言描述生成HDL代码的能力,并支持处理聊天和填充不完整代码等多种任务。利用此流程,我们推出了CodeV系列HDL生成LLM。其中,CodeV-All不仅具备更丰富的语言能力(如Verilog和Chisel)和更广泛的任务范围(如聊天和中间填充),还在VerilogEval基准上取得了与仅针对Verilog微调的CodeV-Verilog相当甚至更优的性能,使其成为首个专为多场景HDL生成设计的开源LLM系列。


Systolic Arrays and Structured Pruning Co-design for Efficient Transformers in Edge Systems

Abstract

arXiv:2411.10285v2 Announce Type: replace-cross Abstract: Efficient deployment of resource-intensive transformers on edge devices necessitates cross-stack optimization. We thus study the interrelation between structured pruning and systolic acceleration, matching the size of pruned blocks with the systolic array dimensions. In this setting, computations of pruned weight blocks can be skipped, reducing run-time and energy consumption, but potentially impacting quality of service (QoS). To evaluate the trade-offs between systolic array size and sparsity opportunities, we present a novel co-design framework that integrates algorithmic optimization, system simulation, and hardware design. Targeting speech recognition and machine translation using transformers as case study, we analyze how configuration choices across the stack affect performance metrics. Results demonstrate that structured pruning on systems featuring systolic array acceleration can effectively increase performance, while maintaining high QoS levels. Up to 44% system-wide speedups due to structured pruning and quantization were measured, with only 1.4% word error rate degradation on the standard LibriSpeech dataset.

摘要

在边缘设备上高效部署资源密集型Transformer模型需要进行跨栈优化。为此,我们研究了结构化剪枝与脉动阵列加速之间的相互关系,将剪枝块的大小与脉动阵列维度相匹配。在此设置下,被剪枝权重块的计算可被跳过,从而降低运行时能耗,但可能影响服务质量(QoS)。为评估脉动阵列尺寸与稀疏化机会之间的权衡,我们提出了一种新型协同设计框架,集成了算法优化、系统模拟和硬件设计。以基于Transformer的语音识别和机器翻译为案例,我们分析了跨栈配置选择对性能指标的影响。结果表明,在配备脉动阵列加速的系统上实施结构化剪枝能有效提升性能,同时保持较高的QoS水平。实验测得结构化剪枝与量化技术最高可实现44%的全系统加速,在标准LibriSpeech数据集上仅导致1.4%的词错误率上升。


XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models

Abstract

arXiv:2411.15100v3 Announce Type: replace-cross Abstract: The applications of LLM Agents are becoming increasingly complex and diverse, leading to a high demand for structured outputs that can be parsed into code, structured function calls, and embodied agent commands. These developments bring significant demands for structured generation in LLM inference. Context-free grammar is a flexible approach to enable structured generation via constrained decoding. However, executing context-free grammar requires going through several stack states over all tokens in vocabulary during runtime, bringing non-negligible overhead for structured generation. In this paper, we propose XGrammar, a flexible and efficient structure generation engine for large language models. XGrammar accelerates context-free grammar execution by dividing the vocabulary into context-independent tokens that can be prechecked and context-dependent tokens that need to be interpreted during runtime. We further build transformations to expand the grammar context and reduce the number of context-independent tokens. Additionally, we build an efficient persistent stack to accelerate the context-dependent token checks. Finally, we co-design the grammar engine with LLM inference engine to overlap grammar computation with GPU executions. Evaluation results show that XGrammar can achieve up to 100x speedup over existing solutions. Combined with an LLM inference engine, it can generate near-zero overhead structure generation in end-to-end low-LLM serving.

摘要

大型语言模型智能体的应用正变得日益复杂多样,导致对可解析为代码、结构化函数调用及具身智能体命令的结构化输出需求激增。这一发展趋势对语言模型推理中的结构化生成提出了重大需求。上下文无关文法是通过约束解码实现结构化生成的灵活方法,但其执行过程需要在运行时遍历词汇表中所有标记的多个栈状态,为结构化生成带来不可忽视的开销。本文提出XGrammar——一个面向大语言模型的灵活高效结构化生成引擎。该引擎通过将词汇表划分为可预检的上下文无关标记和需运行时解释的上下文相关标记,加速了上下文无关文法的执行。我们进一步构建语法上下文扩展转换以减少上下文无关标记数量,并设计高效持久化栈结构来加速上下文相关标记检查。此外,我们通过将文法引擎与语言模型推理引擎协同设计,实现了文法计算与GPU执行的并行化。评估结果表明,XGrammar相较现有方案最高可实现100倍加速,在与语言模型推理引擎结合时,能在端到端低延迟服务中实现近乎零开销的结构化生成。


Evaluating Creative Short Story Generation in Humans and Large Language Models

Abstract

arXiv:2411.02316v5 Announce Type: replace-cross Abstract: Story-writing is a fundamental aspect of human imagination, relying heavily on creativity to produce narratives that are novel, effective, and surprising. While large language models (LLMs) have demonstrated the ability to generate high-quality stories, their creative story-writing capabilities remain under-explored. In this work, we conduct a systematic analysis of creativity in short story generation across 60 LLMs and 60 people using a five-sentence cue-word-based creative story-writing task. We use measures to automatically evaluate model- and human-generated stories across several dimensions of creativity, including novelty, surprise, diversity, and linguistic complexity. We also collect creativity ratings and Turing Test classifications from non-expert and expert human raters and LLMs. Automated metrics show that LLMs generate stylistically complex stories, but tend to fall short in terms of novelty, surprise and diversity when compared to average human writers. Expert ratings generally coincide with automated metrics. However, LLMs and non-experts rate LLM stories to be more creative than human-generated stories. We discuss why and how these differences in ratings occur, and their implications for both human and artificial creativity.

摘要

故事创作是人类想象力的核心体现,其高度依赖创造力来产生新颖、有效且出人意料的叙事。尽管大型语言模型(LLMs)已展现出生成高质量故事的能力,但其创造性写作潜力仍未得到充分探索。本研究通过基于五句子提示词的创意故事写作任务,对60个LLMs和60名人类参与者的短篇故事生成创造力进行了系统分析。我们采用自动化指标从新颖性、意外性、多样性和语言复杂性等多个维度评估模型与人类生成的故事,同时收集了非专业/专业人类评审员及LLMs对创造力的评分和图灵测试分类结果。自动化指标显示,LLMs能生成语言风格复杂的故事,但在新颖性、意外性和多样性方面普遍低于人类作者的平均水平。专家评分与自动化指标基本一致,然而LLMs和非专业评审员倾向于认为LLM生成的故事比人类作品更具创造性。我们探讨了这些评分差异的成因及其对人类与人工智能创造力的启示。


Multi-Party Supervised Fine-tuning of Language Models for Multi-Party Dialogue Generation

Abstract

arXiv:2412.05342v3 Announce Type: replace-cross Abstract: Large Language Models (LLM) are usually fine-tuned to participate in dyadic or two-party dialogues, which can not adapt well to multi-party dialogues (MPD), which hinders their applications in such scenarios including multi-personal meetings, discussions and daily communication. Previous LLM-based researches mainly focus on the multi-agent framework, while their base LLMs are still pairwisely fine-tuned. In this work, we design a multi-party fine-tuning framework (MuPaS) for LLMs on the multi-party dialogue datasets, and prove such a straightforward framework can let the LLM align with the multi-party conversation style efficiently and effectively. We also design two training strategies which can convert MuPaS into the MPD simulator. Substantial experiments show that MuPaS can achieve state-of-the-art multi-party response, higher accuracy of the-next-speaker prediction, higher human and automatic evaluated utterance qualities, and can even generate reasonably with out-of-distribution scene, topic and role descriptions. The MuPaS framework bridges the LLM training with more complicated multi-party applications, such as conversation generation, virtual rehearsal or meta-universe.

摘要

大型语言模型(LLM)通常经过微调以参与二元或双方对话,难以适应多方对话(MPD)场景,这限制了其在多人会议、讨论和日常交流等场景中的应用。先前基于LLM的研究主要集中于多智能体框架,但其基础LLM仍采用成对微调方式。本研究设计了一个面向多方对话数据集的多方微调框架(MuPaS),证明该简洁框架能高效且有效地使LLM与多方对话风格对齐。我们还设计了两种训练策略,可将MuPaS转化为MPD模拟器。大量实验表明:MuPaS能实现最先进的多方响应效果,具有更高的下一位说话者预测准确率、更优的人类与自动评估话语质量,甚至能在分布外的场景、话题和角色描述下生成合理内容。该框架将LLM训练与更复杂的多方应用(如对话生成、虚拟排练或元宇宙)相衔接。


Beyond Text-Visual Attention: Exploiting Visual Cues for Effective Token Pruning in VLMs

Abstract

arXiv:2412.01818v2 Announce Type: replace-cross Abstract: Large vision-language models (LVLMs) generally contain significantly more visual tokens than their textual counterparts, resulting in a considerable computational burden. Recent efforts have been made to tackle this issue by pruning visual tokens early within the language model. Most existing works use attention scores between text and visual tokens to assess the importance of visual tokens. However, in this study, we first analyze the text-visual attention in the language model and find that this score is not an ideal indicator for token pruning. Based on the analysis, We propose VisPruner, a plug-and-play method that utilizes visual cues for more effective token pruning in LVLMs. Specifically, we first use visual attention to select a limited number of significant tokens. Then, we remove duplicate tokens from the remaining ones based on their similarity. By retaining diverse tokens alongside the initially selected important tokens, we maximally preserve the visual information of the input image. Experimental results demonstrate that our VisPruner sustains strong performance across various VLM architectures and reduction ratios, significantly outperforming existing methods based on text-visual attention. Notably, without any training, VisPruner can reduce the FLOPs of LLaVA-1.5-7B by 91% and inference latency by 75%, while maintaining comparable performance. Our code is available at https://github.com/Theia-4869/VisPruner.

摘要

大型视觉语言模型(LVLM)通常包含远多于文本标记的视觉标记,导致计算负担显著增加。近期研究尝试通过在语言模型早期阶段剪枝视觉标记来解决这一问题。现有工作大多利用文本与视觉标记间的注意力分数来评估视觉标记重要性。然而,本研究首先分析了语言模型中的文本-视觉注意力机制,发现该分数并非理想的标记剪枝指标。基于此分析,我们提出VisPruner——一种即插即用方法,通过利用视觉线索实现更高效的LVLM标记剪枝。具体而言,我们首先使用视觉注意力筛选少量关键标记,随后基于相似度从剩余标记中去除重复项。通过保留多样化标记与初始筛选的重要标记,我们最大程度地保持了输入图像的视觉信息。实验结果表明,VisPruner能在不同VLM架构和缩减比率下保持强劲性能,显著优于基于文本-视觉注意力的现有方法。值得注意的是,在无需任何训练的情况下,VisPruner可将LLaVA-1.5-7B的FLOPs降低91%,推理延迟减少75%,同时保持相当性能。代码发布于https://github.com/Theia-4869/VisPruner。


Efficient and Comprehensive Feature Extraction in Large Vision-Language Model for Clinical Pathology Analysis

Abstract

arXiv:2412.09521v2 Announce Type: replace-cross Abstract: Pathological diagnosis is vital for determining disease characteristics, guiding treatment, and assessing prognosis, relying heavily on detailed, multi-scale analysis of high-resolution whole slide images (WSI). However, traditional pure vision models face challenges of redundant feature extraction, whereas existing large vision-language models (LVLMs) are limited by input resolution constraints, hindering their efficiency and accuracy. To overcome these issues, we propose two innovative strategies: the mixed task-guided feature enhancement, which directs feature extraction toward lesion-related details across scales, and the prompt-guided detail feature completion, which integrates coarse- and fine-grained features from WSI based on specific prompts without compromising inference speed. Leveraging a comprehensive dataset of 490,000 samples from diverse pathology tasks-including cancer detection, grading, vascular and neural invasion identification, and so on-we trained the pathology-specialized LVLM, OmniPath. Extensive experiments demonstrate that this model significantly outperforms existing methods in diagnostic accuracy and efficiency, offering an interactive, clinically aligned approach for auxiliary diagnosis in a wide range of pathology applications.

摘要

病理诊断对于确定疾病特征、指导治疗和评估预后至关重要,其高度依赖于对高分辨率全切片图像(WSI)进行详尽的多尺度分析。然而,传统纯视觉模型面临特征提取冗余的挑战,而现有的大型视觉语言模型(LVLM)受限于输入分辨率约束,导致效率和准确性受限。为解决这些问题,我们提出两项创新策略:混合任务引导的特征增强技术,通过多尺度定向提取病灶相关特征;以及提示引导的细节特征补全技术,在不影响推理速度的前提下,基于特定提示整合WSI的粗粒度与细粒度特征。基于涵盖癌症检测、分级、血管与神经浸润识别等多样化病理任务的49万样本数据集,我们训练了病理专用LVLM模型OmniPath。大量实验表明,该模型在诊断准确性和效率上显著优于现有方法,为广泛病理应用提供了一种交互式、临床导向的辅助诊断方案。


Moral Alignment for LLM Agents

Abstract

arXiv:2410.01639v4 Announce Type: replace-cross Abstract: Decision-making agents based on pre-trained Large Language Models (LLMs) are increasingly being deployed across various domains of human activity. While their applications are currently rather specialized, several research efforts are underway to develop more generalist agents. As LLM-based systems become more agentic, their influence on human activity will grow and their transparency will decrease. Consequently, developing effective methods for aligning them to human values is vital. The prevailing practice in alignment often relies on human preference data (e.g., in RLHF or DPO), in which values are implicit, opaque and are essentially deduced from relative preferences over different model outputs. In this work, instead of relying on human feedback, we introduce the design of reward functions that explicitly and transparently encode core human values for Reinforcement Learning-based fine-tuning of foundation agent models. Specifically, we use intrinsic rewards for the moral alignment of LLM agents. We evaluate our approach using the traditional philosophical frameworks of Deontological Ethics and Utilitarianism, quantifying moral rewards for agents in terms of actions and consequences on the Iterated Prisoner's Dilemma (IPD) environment. We also show how moral fine-tuning can be deployed to enable an agent to unlearn a previously developed selfish strategy. Finally, we find that certain moral strategies learned on the IPD game generalize to several other matrix game environments. In summary, we demonstrate that fine-tuning with intrinsic rewards is a promising general solution for aligning LLM agents to human values, and it might represent a more transparent and cost-effective alternative to currently predominant alignment techniques.

摘要

基于预训练大语言模型(LLM)的决策代理正日益广泛应用于人类活动的各个领域。尽管当前应用仍较为专业化,但多项研究正致力于开发更具通用性的代理系统。随着基于LLM的系统自主性增强,其对人类活动的影响将不断扩大,而其透明度则会相应降低。因此,开发有效的方法使其与人类价值观对齐至关重要。

现行的对齐方法通常依赖于人类偏好数据(如RLHF或DPO),其中价值观是隐式、不透明的,本质上通过不同模型输出的相对偏好来推导。本研究摒弃人类反馈机制,提出一种奖励函数设计方法,该方法明确且透明地编码核心人类价值观,用于基于强化学习的基座代理模型微调。具体而言,我们采用内在奖励机制来实现LLM代理的道德对齐。

我们使用传统哲学框架——义务论伦理学与功利主义进行评估,在迭代囚徒困境(IPD)环境中,通过代理行为及其后果量化道德奖励。实验还展示了道德微调如何使代理遗忘先前习得的自私策略。最后发现,在IPD游戏中习得的某些道德策略可泛化至其他矩阵游戏环境。研究表明,采用内在奖励的微调方法是实现LLM代理与人类价值观对齐的通用解决方案,相比当前主流对齐技术,可能提供更透明且更具成本效益的替代路径。


Softplus Attention with Re-weighting Boosts Length Extrapolation in Large Language Models

Abstract

arXiv:2501.13428v3 Announce Type: replace-cross Abstract: Large language models have achieved remarkable success in recent years, primarily due to the implementation of self-attention mechanisms. However, traditional Softmax attention suffers from numerical instability and reduced performance as the length of inference tokens increases. This paper addresses these issues by decomposing the Softmax operation into a non-linear transformation and the l1l_1-norm. We identify the latter as essential for maintaining model performance. By replacing the non-linear transformation with the Softplus activation function and introducing a dynamic scale factor for different token lengths based on invariance entropy, we create a novel attention mechanism with performance better than conventional Softmax attention across various inference lengths. To further improve the length extrapolation ability of the proposed attention mechanism, we introduce a novel re-weighting mechanism that amplifies significant attention weights while diminishing weaker ones, enabling the model to concentrate more effectively on relevant tokens. When combined with our proposed attention mechanism, this approach maintains nearly constant validation loss even at 16×\times the training token length, ensures numerical stability, and achieves superior results on downstream benchmarks.

摘要

大型语言模型近年来取得了显著成功,这主要归功于自注意力机制的应用。然而,传统的Softmax注意力在推理标记长度增加时存在数值不稳定性和性能下降的问题。本文通过将Softmax操作分解为非线性变换和l1l_1范数来解决这些问题,并发现后者对维持模型性能至关重要。通过用Softplus激活函数替代非线性变换,并基于不变性熵为不同标记长度引入动态比例因子,我们提出了一种新型注意力机制,其在不同推理长度下的性能均优于传统Softmax注意力。为进一步提升该注意力机制的长度外推能力,我们设计了一种创新的重加权机制,该机制能放大重要注意力权重并削弱较弱权重,使模型能更有效地聚焦相关标记。当与我们提出的注意力机制结合使用时,该方法即使在16倍于训练标记长度时仍能保持几乎恒定的验证损失,确保数值稳定性,并在下游基准测试中取得更优结果。


Advancing Single and Multi-task Text Classification through Large Language Model Fine-tuning

Abstract

arXiv:2412.08587v2 Announce Type: replace-cross Abstract: Both encoder-only models (e.g., BERT, RoBERTa) and large language models (LLMs, e.g., Llama3) have been widely used for text classification tasks. However, there is a lack of systematic studies comparing the performance of encoder-based models and LLMs in text classification, particularly when fine-tuning is involved. This study employed a diverse range of models and methods, varying in size and architecture, and including both fine-tuned and pre-trained approaches. We first assessed the performances of these LLMs on the 20 Newsgroups (20NG) and MASSIVE datasets, comparing them to encoder-only RoBERTa models. Additionally, we explored the multi-task capabilities of both model types by combining multiple classification tasks, including intent detection and slot-filling, into a single model using data from both datasets. Our results indicate that fully fine-tuned Llama3-70B models outperform RoBERTa-large and other decoder LLMs across various classification tasks and datasets. Moreover, the consolidated multi-task fine-tuned LLMs matched the performance of dual-model setups in both tasks across both datasets. Overall, our study provides a comprehensive benchmark of encoder-only and LLM models on text classification tasks and demonstrates a method to combine two or more fully fine-tuned decoder LLMs for reduced latency and equivalent performance.

摘要

编码器专用模型(如BERT、RoBERTa)和大语言模型(LLM,如Llama3)已广泛应用于文本分类任务。然而,目前缺乏系统研究比较基于编码器的模型与LLM在文本分类中的性能差异,尤其是在涉及微调的情况下。本研究采用了多种不同规模、架构的模型与方法,包括微调和预训练两种策略。我们首先评估了这些LLM在20 Newsgroups(20NG)和MASSIVE数据集上的表现,并与专用编码器RoBERTa模型进行对比。此外,通过整合两个数据集的数据,我们将意图检测和槽填充等多重分类任务合并至单一模型,探索了两类模型的多任务处理能力。实验结果表明,经过完整微调的Llama3-70B模型在各种分类任务和数据集上均优于RoBERTa-large及其他解码器LLM。更重要的是,经多任务联合微调的整合LLM在两项任务、两个数据集上均达到了双模型配置的性能水平。总体而言,本研究为文本分类任务中的编码器专用模型与LLM提供了全面基准测试,并展示了一种通过整合两个及以上完整微调的解码器LLM来保持等效性能同时降低延迟的方法。


Multilingual Performance of a Multimodal Artificial Intelligence System on Multisubject Physics Concept Inventories

Abstract

arXiv:2501.06143v3 Announce Type: replace-cross Abstract: We investigate the multilingual and multimodal performance of a large language model-based artificial intelligence (AI) system, GPT-4o, using a diverse set of physics concept inventories spanning multiple languages and subject categories. The inventories, sourced from the PhysPort website, cover classical physics topics such as mechanics, electromagnetism, optics, and thermodynamics, as well as relativity, quantum mechanics, astronomy, mathematics, and laboratory skills. Unlike previous text-only studies, we uploaded the inventories as images to reflect what a student would see on paper, thereby assessing the system's multimodal functionality. Our results indicate variation in performance across subjects, with laboratory skills standing out as the weakest. We also observe differences across languages, with English and European languages showing the strongest performance. Notably, the relative difficulty of an inventory item is largely independent of the language of the survey. When comparing AI results to existing literature on student performance, we find that the AI system outperforms average post-instruction undergraduate students in all subject categories except laboratory skills. Furthermore, the AI performs worse on items requiring visual interpretation of images than on those that are purely text-based. While our exploratory findings show GPT-4o's potential usefulness in physics education, they highlight the critical need for instructors to foster students' ability to critically evaluate AI outputs, adapt curricula thoughtfully in response to AI advancements, and address equity concerns associated with AI integration.

摘要

我们基于大型语言模型的人工智能(AI)系统GPT-4o,通过跨语言、跨学科领域的多样化物理概念量表,对其多语言与多模态性能进行了研究。这些量表源自PhysPort网站,涵盖经典物理主题(如力学、电磁学、光学和热力学)以及相对论、量子力学、天文学、数学和实验技能。与以往纯文本研究不同,我们将量表以图像形式输入以模拟学生纸质试卷场景,从而评估系统的多模态功能。结果显示:不同学科表现存在差异,其中实验技能表现最弱;不同语言间也存在差距,英语和欧洲语言表现最优。值得注意的是,量表项目的相对难度与调查语言基本无关。将AI结果与现有学生表现文献对比发现,除实验技能外,该AI系统在所有学科类别均优于本科生的平均课后表现。此外,AI在需要图像视觉解析的项目上表现逊于纯文本项目。尽管探索性研究表明GPT-4o在物理教育中具有潜在应用价值,但结果同时强调:教师亟需培养学生批判性评估AI输出的能力,针对AI发展审慎调整课程设计,并重视AI整合过程中可能引发的公平性问题。


Beyond Partisan Leaning: A Comparative Analysis of Political Bias in Large Language Models

Abstract

arXiv:2412.16746v4 Announce Type: replace-cross Abstract: As large language models (LLMs) become increasingly embedded in civic, educational, and political information environments, concerns about their potential political bias have grown. Prior research often evaluates such bias through simulated personas or predefined ideological typologies, which may introduce artificial framing effects or overlook how models behave in general use scenarios. This study adopts a persona-free, topic-specific approach to evaluate political behavior in LLMs, reflecting how users typically interact with these systems-without ideological role-play or conditioning. We introduce a two-dimensional framework: one axis captures partisan orientation on highly polarized topics (e.g., abortion, immigration), and the other assesses sociopolitical engagement on less polarized issues (e.g., climate change, foreign policy). Using survey-style prompts drawn from the ANES and Pew Research Center, we analyze responses from 43 LLMs developed in the U.S., Europe, China, and the Middle East. We propose an entropy-weighted bias score to quantify both the direction and consistency of partisan alignment, and identify four behavioral clusters through engagement profiles. Findings show most models lean center-left or left ideologically and vary in their nonpartisan engagement patterns. Model scale and openness are not strong predictors of behavior, suggesting that alignment strategy and institutional context play a more decisive role in shaping political expression.

摘要

随着大型语言模型(LLMs)在公共、教育和政治信息环境中的广泛应用,其潜在政治偏见问题日益引发关注。既往研究多通过模拟人物角色或预定义意识形态类型进行评估,这种方法可能引入人为框架效应或忽视模型在常规使用场景中的表现。本研究采用无角色设定、主题聚焦的方法评估LLMs的政治行为,以反映用户与系统交互的真实模式——即不预设意识形态角色或条件。我们提出二维分析框架:一轴捕捉高度极化议题(如堕胎、移民)中的党派倾向,另一轴评估低极化议题(如气候变化、外交政策)中的社会政治参与度。基于美国国家选举研究(ANES)和皮尤研究中心的调查式提示词,我们分析了来自美国、欧洲、中国和中东地区开发的43个LLMs的响应。通过提出熵加权偏差分数来量化党派倾向的方向性与一致性,并借助参与度特征识别出四种行为聚类。研究发现:多数模型呈现中左或左翼意识形态倾向,其无党派参与模式存在显著差异;模型规模与开源程度并非行为强预测因子,表明对齐策略和制度环境对政治表达的影响更具决定性。


RadioLLM: Introducing Large Language Model into Cognitive Radio via Hybrid Prompt and Token Reprogrammings

Abstract

arXiv:2501.17888v2 Announce Type: replace-cross Abstract: The growing scarcity of spectrum resources and rapid proliferation of wireless devices make efficient radio network management critical. While deep learning-enhanced Cognitive Radio Technology (CRT) provides promising solutions for tasks such as radio signal classification (RSC), denoising, and spectrum allocation, existing DL-based CRT frameworks are typically task-specific and lack scalability in diverse real-world applications. This limitation naturally leads to the exploration of Large Language Models (LLMs), whose exceptional cross-domain generalization capabilities offer new potential for advancing CRT. To bridge this gap, we propose RadioLLM, a novel framework that integrates Hybrid Prompt and Token Reprogramming (HPTR) for combining radio signal features with expert knowledge, and a Frequency-Attuned Fusion (FAF) module for enhanced high-frequency feature modeling. Extensive evaluations on multiple benchmark datasets demonstrate that RadioLLM achieves superior performance compared to existing baselines in the majority of testing scenarios.

摘要

随着频谱资源日益紧张和无线设备数量激增,高效的无线电网络管理变得至关重要。尽管深度学习增强的认知无线电技术(CRT)为无线电信号分类(RSC)、去噪和频谱分配等任务提供了有前景的解决方案,但现有基于深度学习的CRT框架通常仅针对特定任务设计,在多样化实际应用中缺乏可扩展性。这一局限性自然引发了对大语言模型(LLMs)的探索,其卓越的跨领域泛化能力为推进CRT技术带来了新的可能性。为填补这一空白,我们提出RadioLLM框架,该框架通过混合提示与令牌重编程(HPTR)技术将无线电信号特征与专家知识相结合,并采用频率适配融合(FAF)模块来增强高频特征建模能力。在多组基准数据集上的大量实验表明,RadioLLM在大多数测试场景中均优于现有基线方法,展现出卓越性能。


The Devil Is in the Details: Tackling Unimodal Spurious Correlations for Generalizable Multimodal Reward Models

Abstract

arXiv:2503.03122v3 Announce Type: replace-cross Abstract: Multimodal Reward Models (MM-RMs) are crucial for aligning Large Language Models (LLMs) with human preferences, particularly as LLMs increasingly interact with multimodal data. However, we find that MM-RMs trained on existing datasets often struggle to generalize to out-of-distribution data due to their reliance on unimodal spurious correlations, primarily text-only shortcuts within the training distribution, which prevents them from leveraging true multimodal reward functions. To address this, we introduce a Shortcut-aware MM-RM learning algorithm that mitigates this issue by dynamically reweighting training samples, shifting the distribution toward better multimodal understanding, and reducing dependence on unimodal spurious correlations. Our experiments demonstrate significant improvements in generalization, downstream task performance, and scalability, establishing a more robust framework for multimodal reward modeling.

摘要

多模态奖励模型(MM-RMs)对于将大语言模型(LLMs)与人类偏好对齐至关重要,尤其是在LLMs日益频繁地处理多模态数据时。然而,我们发现基于现有数据集训练的MM-RMs由于依赖单模态伪相关性(主要是训练分布中仅含文本的捷径),往往难以泛化至分布外数据,导致其无法有效利用真正的多模态奖励函数。为解决这一问题,我们提出了一种"捷径感知"的MM-RMs学习算法,该算法通过动态调整训练样本权重、将数据分布转向更优的多模态理解,并降低对单模态伪相关性的依赖,从而有效缓解上述缺陷。实验结果表明,该方法在泛化能力、下游任务表现和可扩展性方面均有显著提升,为多模态奖励建模建立了更鲁棒的框架。


Abstract

arXiv:2503.01921v2 Announce Type: replace-cross Abstract: SemEval-2025 Task 3 (Mu-SHROOM) focuses on detecting hallucinations in content generated by various large language models (LLMs) across multiple languages. This task involves not only identifying the presence of hallucinations but also pinpointing their specific occurrences. To tackle this challenge, this study introduces two methods: modified RefChecker and modified SelfCheckGPT. The modified RefChecker integrates prompt-based factual verification into References, structuring them as claim-based tests rather than single external knowledge sources. The modified SelfCheckGPT incorporates external knowledge to overcome its reliance on internal knowledge. In addition, both methods' original prompt designs are enhanced to identify hallucinated words within LLM-generated texts. Experimental results demonstrate the effectiveness of the approach, achieving a high ranking on the test dataset in detecting hallucinations across various languages, with an average IoU of 0.5310 and an average COR of 0.5669.

摘要

SemEval-2025任务3(Mu-SHROOM)致力于检测多种语言大模型(LLM)生成内容中的幻觉现象。该任务不仅需要识别幻觉的存在,还需精确定位其具体发生位置。为应对这一挑战,本研究提出两种改进方法:修正版RefChecker与修正版SelfCheckGPT。修正版RefChecker将基于提示的事实核查整合至参考文献中,将其构建为基于主张的测试框架而非单一外部知识源;修正版SelfCheckGPT则引入外部知识以克服其对内部知识的依赖。此外,两种方法的原始提示设计均经过优化,可识别LLM生成文本中的幻觉词汇。实验结果表明,该方法在多语言幻觉检测测试数据集上表现优异,平均交并比(IoU)达0.5310,平均正确率(COR)为0.5669,排名靠前。


SemViQA: A Semantic Question Answering System for Vietnamese Information Fact-Checking

Abstract

arXiv:2503.00955v2 Announce Type: replace-cross Abstract: The rise of misinformation, exacerbated by Large Language Models (LLMs) like GPT and Gemini, demands robust fact-checking solutions, especially for low-resource languages like Vietnamese. Existing methods struggle with semantic ambiguity, homonyms, and complex linguistic structures, often trading accuracy for efficiency. We introduce SemViQA, a novel Vietnamese fact-checking framework integrating Semantic-based Evidence Retrieval (SER) and Two-step Verdict Classification (TVC). Our approach balances precision and speed, achieving state-of-the-art results with 78.97% strict accuracy on ISE-DSC01 and 80.82% on ViWikiFC, securing 1st place in the UIT Data Science Challenge. Additionally, SemViQA Faster improves inference speed 7x while maintaining competitive accuracy. SemViQA sets a new benchmark for Vietnamese fact verification, advancing the fight against misinformation. The source code is available at: https://github.com/DAVID-NGUYEN-S16/SemViQA.

摘要

随着GPT和Gemini等大型语言模型(LLMs)的兴起,虚假信息问题日益严重,这要求我们为越南语等低资源语言提供强大的事实核查解决方案。现有方法在处理语义模糊性、同音异义词和复杂语言结构时往往捉襟见肘,常常以牺牲准确性为代价换取效率。我们提出了SemViQA,一种新颖的越南语事实核查框架,集成了基于语义的证据检索(SER)和两步裁决分类(TVC)。我们的方法在精度和速度之间取得了平衡,在ISE-DSC01数据集上达到了78.97%的严格准确率,在ViWikiFC数据集上达到了80.82%的准确率,并在UIT数据科学挑战赛中获得了第一名。此外,SemViQA Faster在保持竞争力的准确率的同时,将推理速度提高了7倍。SemViQA为越南语事实核查设立了新的基准,推动了打击虚假信息的进程。


Abstract

arXiv:2503.21098v2 Announce Type: replace-cross Abstract: Generative retrieval (GR) has revolutionized document retrieval with the advent of large language models (LLMs), and LLM-based GR is gradually being adopted by the industry. Despite its remarkable advantages and potential, LLM-based GR suffers from hallucination and generates documents that are irrelevant to the query in some instances, severely challenging its credibility in practical applications. We thereby propose an optimized GR framework designed to alleviate retrieval hallucination, which integrates knowledge distillation reasoning in model training and incorporate decision agent to further improve retrieval precision. Specifically, we employ LLMs to assess and reason GR retrieved query-document (q-d) pairs, and then distill the reasoning data as transferred knowledge to the GR model. Moreover, we utilize a decision agent as post-processing to extend the GR retrieved documents through retrieval model and select the most relevant ones from multi perspectives as the final generative retrieval result. Extensive offline experiments on real-world datasets and online A/B tests on Fund Search and Insurance Search in Alipay demonstrate our framework's superiority and effectiveness in improving search quality and conversion gains.

摘要

生成式检索(GR)借助大语言模型(LLM)的兴起彻底改变了文档检索方式,基于LLM的GR技术正逐步被工业界采用。尽管具有显著优势和潜力,但基于LLM的GR存在幻觉问题,在某些情况下会生成与查询无关的文档,严重影响了其在实际应用中的可信度。为此,我们提出了一种优化的GR框架以缓解检索幻觉,该框架在模型训练中融合知识蒸馏推理,并引入决策代理以进一步提升检索精度。具体而言,我们利用LLM对GR检索到的查询-文档(q-d)对进行评估推理,随后将推理数据作为迁移知识蒸馏至GR模型。此外,我们采用决策代理作为后处理模块,通过检索模型扩展GR检索结果,并从多维度筛选最相关文档作为最终生成式检索结果。在真实数据集上的大量离线实验,以及在支付宝基金搜索和保险搜索场景下的在线A/B测试,均证明了该框架在提升搜索质量和转化收益方面的优越性与有效性。


Understanding Learner-LLM Chatbot Interactions and the Impact of Prompting Guidelines

Abstract

arXiv:2504.07840v2 Announce Type: replace-cross Abstract: Large Language Models (LLMs) have transformed human-computer interaction by enabling natural language-based communication with AI-powered chatbots. These models are designed to be intuitive and user-friendly, allowing users to articulate requests with minimal effort. However, despite their accessibility, studies reveal that users often struggle with effective prompting, resulting in inefficient responses. Existing research has highlighted both the limitations of LLMs in interpreting vague or poorly structured prompts and the difficulties users face in crafting precise queries. This study investigates learner-AI interactions through an educational experiment in which participants receive structured guidance on effective prompting. We introduce and compare three types of prompting guidelines: a task-specific framework developed through a structured methodology and two baseline approaches. To assess user behavior and prompting efficacy, we analyze a dataset of 642 interactions from 107 users. Using Von NeuMidas, an extended pragmatic annotation schema for LLM interaction analysis, we categorize common prompting errors and identify recurring behavioral patterns. We then evaluate the impact of different guidelines by examining changes in user behavior, adherence to prompting strategies, and the overall quality of AI-generated responses. Our findings provide a deeper understanding of how users engage with LLMs and the role of structured prompting guidance in enhancing AI-assisted communication. By comparing different instructional frameworks, we offer insights into more effective approaches for improving user competency in AI interactions, with implications for AI literacy, chatbot usability, and the design of more responsive AI systems.

摘要

大型语言模型(LLMs)通过实现与AI驱动聊天机器人的自然语言交互,彻底改变了人机互动方式。这些模型设计直观且用户友好,使用户能够以最小努力表达需求。然而研究表明,尽管其易用性,用户仍常面临有效提示的困难,导致低效响应。现有研究既揭示了LLMs在解释模糊或结构不良提示时的局限性,也指出了用户构建精确查询的困境。本研究通过教育实验调查学习者与AI的互动,参与者接受了结构化提示指导。我们提出并比较三种提示指南:采用结构化方法开发的任务特定框架与两种基线方法。为评估用户行为和提示效能,我们分析了来自107位用户的642次交互数据集。通过扩展的语用标注框架Von NeuMidas,我们对常见提示错误进行分类并识别重复行为模式。随后通过考察用户行为变化、提示策略遵循度及AI生成响应质量,评估不同指南的影响。研究发现深化了关于用户如何与LLMs互动的理解,并揭示了结构化提示指导对提升AI辅助交流的作用。通过比较不同教学框架,我们为提升用户AI交互能力提供了更有效的方法见解,这对AI素养、聊天机器人可用性以及响应式AI系统设计具有重要启示。


DyMU: Dynamic Merging and Virtual Unmerging for Efficient VLMs

Abstract

arXiv:2504.17040v2 Announce Type: replace-cross Abstract: We present DyMU, an efficient, training-free framework that dynamically reduces the computational burden of vision-language models (VLMs) while maintaining high task performance. Our approach comprises two key components. First, Dynamic Token Merging (DToMe) reduces the number of visual token embeddings by merging similar tokens based on image complexity, addressing the inherent inefficiency of fixed-length outputs in vision transformers. Second, Virtual Token Unmerging (VTU) simulates the expected token sequence for large language models (LLMs) by efficiently reconstructing the attention dynamics of a full sequence, thus preserving the downstream performance without additional fine-tuning. Unlike previous approaches, our method dynamically adapts token compression to the content of the image and operates completely training-free, making it readily applicable to most state-of-the-art VLM architectures. Extensive experiments on image and video understanding tasks demonstrate that DyMU can reduce the average visual token count by 32%-85% while achieving comparable performance to full-length models across diverse VLM architectures, including the recently popularized AnyRes-based visual encoders. Furthermore, through qualitative analyses, we demonstrate that DToMe effectively adapts token reduction based on image complexity and, unlike existing systems, provides users more control over computational costs. Project page: https://mikewangwzhl.github.io/dymu/.

摘要

我们提出DyMU,一种高效、无需训练的动态框架,能够显著降低视觉语言模型(VLM)的计算负担,同时保持高任务性能。该方法包含两个关键组件:首先,动态令牌合并(DToMe)通过基于图像复杂度合并相似令牌,减少视觉令牌嵌入数量,从而解决视觉Transformer中固定长度输出固有的效率问题;其次,虚拟令牌解合并(VTU)通过高效重构完整序列的注意力动态,模拟大型语言模型(LLM)预期的令牌序列,从而在不进行额外微调的情况下保持下游性能。与现有方法不同,我们的技术能根据图像内容动态调整令牌压缩比例,且完全无需训练,可轻松应用于大多数前沿VLM架构。在图像和视频理解任务上的大量实验表明,DyMU能将平均视觉令牌数量减少32%-85%,同时在包括近期流行的基于AnyRes视觉编码器在内的多种VLM架构上,实现与完整序列模型相当的性能。此外,定性分析表明DToMe能根据图像复杂度有效调整令牌压缩比例,与现有系统不同,该方法为用户提供了对计算成本的更强控制能力。项目页面:https://mikewangwzhl.github.io/dymu/。


Tiled Flash Linear Attention: More Efficient Linear RNN and xLSTM Kernels

Abstract

arXiv:2503.14376v2 Announce Type: replace-cross Abstract: Linear RNNs with gating recently demonstrated competitive performance compared to Transformers in language modeling. Although their linear compute scaling in sequence length offers theoretical runtime advantages over Transformers, realizing these benefits in practice requires optimized custom kernels, as Transformers rely on the highly efficient Flash Attention kernels (Dao, 2024). Leveraging the chunkwise-parallel formulation of linear RNNs, Flash Linear Attention (FLA) (Yang & Zhang, 2024) shows that linear RNN kernels are faster than Flash Attention, by parallelizing over chunks of the input sequence. However, since the chunk size of FLA is limited, many intermediate states must be materialized in GPU memory. This leads to low arithmetic intensity and causes high memory consumption and IO cost, especially for long-context pre-training. In this work, we present Tiled Flash Linear Attention (TFLA), a novel kernel algorithm for linear RNNs, that enables arbitrary large chunk sizes and high arithmetic intensity by introducing an additional level of sequence parallelization within each chunk. First, we apply TFLA to the xLSTM with matrix memory, the mLSTM (Beck et al., 2024). Second, we propose an mLSTM variant with sigmoid input gate and reduced computation for even faster kernel runtimes at equal language modeling performance. In our speed benchmarks, we show that our new mLSTM kernels based on TFLA outperform highly optimized Flash Attention, Linear Attention and Mamba kernels, setting a new state of the art for efficient long-context sequence modeling primitives.

摘要

带有门控机制的线性RNN近期在语言建模任务中展现出与Transformer相媲美的性能。尽管其序列长度的线性计算复杂度在理论上比Transformer更具运行时优势,但要在实践中实现这些优势需要优化定制内核,因为Transformer依赖于高度优化的Flash Attention内核(Dao,2024)。基于线性RNN的分块并行计算特性,Flash Linear Attention(FLA)(Yang & Zhang,2024)通过并行处理输入序列块,证明了线性RNN内核比Flash Attention更快。然而由于FLA的块大小受限,必须将大量中间状态存储在GPU内存中,导致算术强度降低并引发高内存消耗与IO开销,这对长上下文预训练尤为明显。本研究提出分块闪存线性注意力(TFLA)——一种创新的线性RNN内核算法,通过在每块内部引入额外的序列并行化层级,实现任意大分块尺寸与高算术强度。首先,我们将TFLA应用于具有矩阵记忆的xLSTM(即mLSTM,Beck等人,2024)。其次,我们提出采用sigmoid输入门并减少计算的mLSTM变体,在保持同等语言建模性能的同时获得更快的内核运行速度。速度基准测试表明,基于TFLA的新型mLSTM内核在性能上超越高度优化的Flash Attention、Linear Attention和Mamba内核,为高效长上下文序列建模原语树立了新的技术标杆。


Rewriting Pre-Training Data Boosts LLM Performance in Math and Code

Abstract

arXiv:2505.02881v2 Announce Type: replace-cross Abstract: The performance of large language models (LLMs) in program synthesis and mathematical reasoning is fundamentally limited by the quality of their pre-training corpora. We introduce two openly licensed datasets, released under the Llama 3.3 Community License, that significantly enhance LLM performance by systematically rewriting public data. SwallowCode (approximately 16.1 billion tokens) refines Python snippets from The-Stack-v2 through a novel four-stage pipeline: syntax validation, pylint-based style filtering, and a two-stage LLM rewriting process that enforces style conformity and transforms snippets into self-contained, algorithmically efficient examples. Unlike prior methods that rely on exclusionary filtering or limited transformations, our transform-and-retain approach upgrades low-quality code, maximizing data utility. SwallowMath (approximately 2.3 billion tokens) enhances Finemath-4+ by removing boilerplate, restoring context, and reformatting solutions into concise, step-by-step explanations. Within a fixed 50 billion token training budget, continual pre-training of Llama-3.1-8B with SwallowCode boosts pass@1 by +17.0 on HumanEval and +17.7 on HumanEval+ compared to Stack-Edu, surpassing the baseline model's code generation capabilities. Similarly, substituting SwallowMath yields +12.4 accuracy on GSM8K and +7.6 on MATH. Ablation studies confirm that each pipeline stage contributes incrementally, with rewriting delivering the largest gains. All datasets, prompts, and checkpoints are publicly available, enabling reproducible research and advancing LLM pre-training for specialized domains.

摘要

大型语言模型(LLM)在程序合成与数学推理方面的性能根本上受限于其预训练语料的质量。我们引入两个基于Llama 3.3社区许可证公开授权的数据集,通过系统性重写公开数据显著提升LLM性能。SwallowCode(约161亿token)采用新颖的四阶段流程改进The-Stack-v2中的Python代码片段:语法验证、基于pylint的风格过滤,以及两阶段LLM重写过程——强制风格一致性并将代码片段转化为自包含的算法高效示例。与依赖排他性过滤或有限转换的现有方法不同,我们的"转换-保留"方法能提升低质量代码,最大化数据效用。SwallowMath(约23亿token)通过去除样板文本、还原上下文、将解决方案重构为简洁的逐步解释来增强Finemath-4+。在固定500亿token训练预算下,使用SwallowCode对Llama-3.1-8B进行持续预训练,相较Stack-Edu在HumanEval上pass@1提升+17.0,HumanEval+提升+17.7,超越基线模型的代码生成能力。类似地,采用SwallowMath使GSM8K准确率提升+12.4,MATH提升+7.6。消融研究证实每个流程阶段均具有增量贡献,其中重写阶段收益最大。所有数据集、提示词和检查点均已公开,可促进可复现研究并推动专业领域LLM预训练发展。